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ABSTRACT 


Progress  on  the  Aerospace  Intelligence  Data  System  (AIDS) 
Research  Contract  (AF  19(62o4-10)  is  presented.  The  one  year  effort 
is  summarized  and  the  progressSand  status  of  the  fourth  quarterly 
period  of  this  effort  is  reported.  The  design  philosophy  of  an  adjunctive 
equipment  group  (AN/GYA-(  ))  to  be  employed  in  developing  anXon 
demand\  computer  service  capability  with  a  large  data  processor*!^ 
described.  A  technical  paper  evaluating  search  time 

bounds  for  various  probability  distributions  of  the  file  location  of  stored 
information  and  several  search  strategies»w>  pgeaonted^ The  theoretical 
basis  is  established,  in  this  report,  for  a  procedure  of  sampling  files 
in  order  to  estimate  (with  confidence  levels  obtained)  the  number  of 
filed  items  having  some  specified  characteristics  in  common.  A 
methodology  (computer  oriented,  using  citation  data)  for  establishing 


"clusters"  of  related  items  is  reported. 
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AEROSPACE  INTELLIGENCE  DATA  SYSTEM  (AIDS) 

QUARTERLY  REPORT  NO.  4 

SECTION  I 

Introduction  and  Summary 

Applied  research  work  performed  under  Contract  AF  19(6 26)  —  10 
is  guided  by  Work  Statement  dated  20  March  1961  as  revised  on 
8  August  1961.  Four  technical  areas  of  effort  are  identified  therein, 
for  which  applied  research  effort  of  either  a  reporting,  or  of  an  exe¬ 
cution  nature  is  prescribed.  These  technical  areas  and  corresponding 
categories  of  effort  are  identified  and  reported  on  in  this  quarterly 
report.  The  objective  of  this  contractual  effort,  as  stated  in  the  work 
statement,  is:  "The  AIDS  Applied  Research  Program  .. .  has  the 
overall  goal  of  supporting  the  progress  towards  a  SAC  operational 
sub-system  by  reporting  on  results  of  non-numerical  data  automation 
techniques  and  equipment  development  now  in  progress  in  industry 
during  the  time  period  of  AIDS.  Selected  implementation  of  these 
results  will  assure  an  optimum  data  handling  equipment  complex 
and  utilization  of  the  most  efficient  techniques  in  the  SAC  operational 
intelligence  data  handling  subsystem.  " 
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Inasmuch  as  this  report  represents  the  fourth  report  of  a 
continuing  research  effort,  a  summary  of  the  past  effort  and  accom¬ 
plishments  on  each  task,  since  the  project's  inception,  is  provided 
in  addition  to  reports  on  the  work  of  the  last  quarter. 

The  general  organization  of  this  report  is  as  follows: 

Volume  1 

(1)  One  year  review  and  last  quarter  progress  and 
status  summary 

(2)  Appendix  A:  The  General  Design  Philosophy 

of  the  AN/GYA-(  ) 

(3)  Appendix  B:  Optimum  Search  Strategies 

(4)  Appendix  C:  Sampling  for  Co-Occurrence 

(5)  Appendix  D:  An  Experimental  Investigation  of 
"Clustering" 

Volume  2 

(1)  Test  of  the  Datacom  Model  408-2  (to  be  submitted 
upon  completion  of  tests) 


2 


SECTION  II 


Summary  Report 

Task  I  -  Automatic  Print  Reading 
General  Comment: 

This  task  has  consisted  of  a  monitoring  effort  on  progress 
being  made  on  the  task  of  automatic  methods  of  converting  printed  | 
text  into  machine  processable  form. 

Results: 

A  complete  bibliography  of  "Character  Recognition  System 
Patents"  was  provided  in  the  second  quarterly  report  (Volume  II). 

This  bibliography  additionally  Included  many  patents  that  pertained 
to  elements  of  character  recognition  systems  as  well  as  patents 
pertaining  to  allied  fields  such  as  facsimile,  curve  following, 
signal  analysis,  code  and  perforated  record  analysis. 

Two  research  efforts  pn  new  and  promising  approaches  to 
character  recognition  were  reported  in  some  technical  detail.  Verbal 
presentations  and  demonstration  of  a  semi-automatic  method  of  rapid 
and  economical  conversion  of  textual  material  to  machine  processable 
form  by  virtue  of  a  developed  "Stenowriter"  system  were  provided. 
Task  II  -  Storage  and  Addressing 
General  Comment: 

This  work  included  conceptual  and  design  efforts  for  exploiting 
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the  high  storage  densities  and  unique  addressing  features  of  the  Air 
Force's  AN/GSQ-16  equipment  in  conjunction  with  a  large  scale  data 

t 

processor,  for  the  purpose  of  providing:  "on  demand"  computer 
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service  to  intelligence  analysts,  greatly  increased  assistance  to 
analysts  in  formulating  queries,  and  development  of  methods  of 
semi-automatic  or  automatic  pre-processing  of  messages  or  text 
into  machine  processable  form.  Sub-tasks  include: 

(1)  Photostore  Disc  storage  of  "quick  demand"  programs 

(2)  Searching  of  formatted  files 

(3)  Processing  of  unformatted  file  materials 

Task  II  -  Sub-task  1:  Disc  Storage  of  Quick  Demand  Programs 
Results: 

Three  alternative  engineering  approaches  for  connecting  a 
Photostore  (AN/GSQ-16)  to  an  IBM  7090,  involving  a  minimum  of 
computer  interference  or  "down-time",  have  been  examined  with 
the  result  that  recommendation  that  one  of  these  designs  --  a  direct 
data  channel  connection  --be  accomplished. 

The  general  design  philosophy  of  a  recommended  effort  « 

(consummated  in  contract  AF  30(602)-27  54)  for  the  AN/GYA-(  ) 

I 

Computer  Storage  Integration  Group  has  been  evolved,,  and  is 
outlined  in  appendix  A  of  this  report. 


4 


Task  II  -  Sub-task  2:  Searching  of  Formatted  Files 
General  Comment: 

The  problem  of  searching  files  of  information  to  answer 
specific  requests  has  been  examined  from  a  general  theoretical 
approach  with  results  being  applicable  to  many  equipment  and  file- 
content  configurations. 

Results: 

(a)  A  fundamental  mathematical  relation  expressing 
minimum  average  file  search  time  in  terms  of  the 
probability  of  location  of  the  "sought-for"  file  material 
has  been  found. 

(b)  An  important  new  technique  for  solving  the  fore¬ 
going  mathematical  expression  for  minimum  average 
file  search  time  has  been  developed. 

(c)  As  a  result  of  (a)  and  (b),  optimal  search  strategy 
for  a  random  access  file  mechanism  having  material 
filed  within  it,  in  an  ordered  manner,  has  been  deter¬ 
mined.  Similarly,  for  this  case  the  corresponding 
minimum  average  search  time  has  been  determined. 

(d)  A  mathematical  method  of  sampling  files  for  the 
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possible  co-occurrence  of  specified  terms  or  events 
has  been  developed  which  should  have  considerable 
utility  in  "on-demand"  multiprocessing  information 
retrieval  systems  which  employ  rapid  access,  large 
capacity  peripheral  files. 

Progress  and  Status  this  Quarter: 

Dr.  Eugene  Wong  has  continued  his  development  of  a  mathe¬ 
matical  basis  for  determining  optimum  search  strategies.  He  has 
examined  in  detail  the  problem  of  determining  efficient  techniques 
for  locating  items  stored  in  tabular  (ordered)  format  in  computer 
storage  units.  These  are  reported  in  Appendix  B  along  with  suggested 
avenues  of  research  leading  to  practical  applications  of  these  "bounds". 

Dr.  C.  Abraham  hac  further  developed  the  mathematical  basis 
for  sampling  a  file  for  items  having  a  co-occurrence  of  say  a  descrip¬ 
tive  term  or  associated  event.  This  work  is  of  particular  significance 
for  "on-demand"  multi-processing  computer  complexes  where  some 
"feedback"  to  the  analysts  making  the  queries  must  be  provided 
concerning  the  estimated  magnitude  of  file  response  to  be  obtained. 
Such  "feedback"  based  on  "sampling"  rather  than  on  exhaustive  file 
search,  can  obiviate  unnecessarily  time  consuming  searches  and 


6. 


allow  the  analyst  to  narrow  the  search  if  he  so  desires  by  adding 
qualifying  search  terms  with  logical  conditions  pertaining  to  their 
relationships.  This  recent  work  is  described  in  Appendix  C  along 
with  associated  work  on  the  correlation  task. 

Task  II  -  Sub-task  3:  Processing  of  Unformatted  File  Material 
General  Comment: 

Examination  of  this  problem  has  led  directly  to  the  central 
problem  of  information  retrieval}  that  is,  "what  is  the  best  form 
into  which  one  should  translate  (or  characterize)  unformatted  source 
material,  for  machine  searching?"  and  secondarily,  but  important 
in  an  economic  sense,  "How  can  unformatted  text  be  converted  (or 
processed)  to  this  form?". 

Work  supported  by  AF  49(638)-1062  has  provided  asymptotic 
expressions  for  machine  search  time  for  a  large  file  of  retrievable 
items,  searchable  by  a  query  made  up  of  terms  in  conjunctive  and 
disjunctive  form,  and  an  asymptotic  expression  for  machine  search 
time.  This  search  time,  for  a  large  file  of  retrievable  items,  is 
proportional  to  the  square  of  the  average  number  of  retrievable  items 
in  the  file  that  pertain  to  each  description  term  (non-negated)  that 
occurs  in  the  query  (reference  AF  OSR-TN-61  -2). 
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Dr.  Bohnert  of  the  Experimental  Systems  Department,  IBM 
Research,  has  been  studying  with  some  success  transformations 
between  pseudo-English  language  sentence  structure  and  equivalent 
symbolic  logic  expressions.  Additionally,  J.  Andress  of  this 
Department  has  completed  a  powerful  program-tool  for  generating 
sentences  from  stated  rules  of  grammar  (to  be  presented  at  the 
International  Congress  of  Linguists,  MIT,  28  August  1962,  by  Lees, 
Matthews  and  Andress).  This  program  should  allow  penetrating 
analysis  by  linguists  of  the  adequacy  of  hypothesized  grammatical 
rules,  through  examination  of  large  quantities  of  sentences  generated 
by  these  rules. 

Progress  and  Status  this  Quarter: 

As  a  result  of  the  above  cited  current  and  pending  investigations 
(including  a  statistical  study  of  texts  by  the  IBM  San  Jose  group)  which 
are  attacking  fundamental  aspects  of  English  to  English  translation,  a 
minimum  of  effort  was  expended  on  this  task  of  the  project  effort. 

Task  IQ  -  Multiple  Interrogation 
General  Comment: 

The  basis  of  this  effort  is  the  premise  that  large  computer 
complexes  which  are  intended  in  part  to  aid  intelligence  analysts  in 
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accomplishing  their  functions,  should  be  tailored  to  provide  a  con¬ 
venient,  timely,  and  useable  service  rather  than  serving  entirely  as 
large  scale  "batched  problem"  processors  whose  "throughput"  rather 
than  general  utility  is  considered  paramount.  Two  major  divisions 
of  non-hardware  effort  are  identifiable: 

(1)  Programming  research  for  multiprocessing 

(2)  Query  formulation  programming 

Each  of  these  "software"  efforts  are  intended  eventually  to  be 
integrated  with,  and  employ  the  Computer  Storage  Integration  Groups 
AN/GYA-(  )  as  outlined  in  Appendix  A. 

Results: 

An  overall  technical  plan  has  been  prepared  for  a  combined 
hardware  and  software  program  leading  to  a  demonstrable  capability 
of  mixed  simultaneous  operation  of  a  routine  computer  task  and  a 
user  query  formulation  and  file  search  (see  Quarterly  Report  No.  2 
Volume  III  and  Appendix  A  of  this  report). 

The  outline  of  the  programming  approach  and  structure  that 
is  being  pursued  is  as  follows: 

(1)  development  of  an  advanced  supervisory  control 

program  with  complete  accounting  lists  of  the  status  of 

each  program  being  worked  on  (or  requested)  and 
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(2)  quick  reference  to  program  segments  stored  on  the 
photostore  as  an  adjunct  to  use  of  the  main  core  memory 
for  operating  program  storage.  In  addition,  the  first 
form  of  a  query  formulation  procedure  and  program,  has 
been  nearly  completed  and  de -bugged.  This  program 
stresses  the  interplay  between  the  computer  and  the  user 
(at  a  suitable  display  console).  "One-shot"  query  prepa¬ 
ration  and  extensive  waiting  for  an  answer  will  be  avoided 
by  computer  provision  of  partial  responses  and  possible 
alternative  question  sequences  which  are  posed  to  the 
user  as  he  develops  his  query  in  a  knowledgeable  and 
natural  manner. 

Progress  and  Status  this  Quarter: 

The  query  simplification  program  in  a  simple  outline  form 
has  been  completed  and  is  currently  being  de -bugged.  The  program 
in  its  initial  form  presents  a  sequence  of  display  instructions  to  the 
person  making  the  query  (as  follows): 

Display  Title  Comment 

(1)  Enter  Field  Name  User  is  asked  to  select  an 

appropriate  "Field"  or  sub- 
topic  from  the  total  computer 
stored  data 


Display  Title 


Comment 


(2)  Specify  Parameter  Match 
Criterion 


(3)  Enter  search  parameters 


(4)  Expand,  Narrow,  or  End 
Query 


This  work  will  continue  to 


The  user  is  given  the  oppor¬ 
tunity  to  specify  query  match 
conditions  such  as  "equal  to", 
"greater  than",  "less  than", 
or  "less  than  and  equal  to", 
"greater  than  and  equal  to" 
and  "not  equal  to" 

An  opportunity  is  here  pro¬ 
vided  to  insert  terms,  names, 
and  characteristics  that  per¬ 
tain  to  the  query.  This  is 
equivalent  to  the  "inclusive  or" 
retrieval  of  all  Field  items 

containing  these  terms  or 
characte  ristic  s. 

This  provides  an  additional 

opportunity  to  expand  the  file 

search  by  adding  additional 

subject  fields  and  requesting 

search  on  an  "inclusive  or" 

or  a  "conjunctive  (logical 

intersection)"  basis. 

expanded  upon  this  program  out¬ 


line  with  the  demonstration  of  its  utility  as  the  primary  goal. 
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The  Advanced  Supervisory  Control  Program  (overall  execu¬ 
tive  control  program)  to  control  the  multiprogrammed  operations  on 
the  AN/GYA-(  )  -IBM  7090  Integrated  equipment  has  been  outlined 
in  scope  and  some  detailed  programming  on  key  portions  of  it  has 
begun. 

Task  IV  -  Correlation  and  Classification  Techniques 
General  Comment: 

The  general  intelligence  problems  characterized  by  the 
following  questions  are  being  examined:  "What  attributes  or  charac¬ 
teristics  are  really  important  with  respect  to  the  item  or  problem  in 
which  one  is  interested?"  or  "Do  these  particular  measured  charac¬ 
teristics  signify  this  (or  that)  overall  class  of  actions  or  structures?", 
or  "If  this  person  or  thing  pertains  to  something  of  possible  interest 
are  these  other  people  or  things  also  involved?" 

Results: 

Classification:  Early  in  this  program,  basic  work  was  accomplished 
that  led  to  an  FSD  developed  FORTRAN  program  for  determining 
"into  which  of  two  categories  or  classes  an  item  (having  up  to  40 
measured  characteristics)  should  be  placed.  This  work  has  been 
extended  to  obtain  methods  of  determining  a  confidence  measure  for 
such  classification  action  even  though  it  is  based  on  an  extremely 
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small  sample  of  the  population  classes  into  which  one  is  attempting 
to  place  the  new  item. 

Correlation:  Three  experiments  were  completed  in  the  process  of 
developing  techniques  of  "clustering"  or  determining  fundamental 
relationships  among  items  (people,  documents,  R  &  D  activity,  sus¬ 
pected  military  installations,  etc. )  on  the  basis  of  measurable  charac¬ 
teristics  of  these  items  which  do  not  obviously  nor  directly  yield  a 
possible  or  probable  "connection"  between  them.  In  addition,  as  re¬ 
ported  in  Quarterly  Report  No.  3  (Volume  I),  Dr.  Abraham  reviewed 
and  mathematically  analyzed  the  appropriateness  of  use  of  various 
"association  measures". 

A  report  of  an  experiment  that  delineates  and  assesses  a 
new  concept  of  dissemination  of  information  based  upon  categorizing 
groups  of  report  recipients  on  the  basis  of  a  high  probability  of 
common  reading  interest,  was  reported  in  Quarterly  Report  No.  1. 
Progress  and  Status  this  Quarter: 

Miss  Reisner  has  continued  the  work  on  methodologies  of 
"clustering",  that  is,  grouping  of  items  together  on  some  basis  of 
of  common  relationship.  Since  the  co-authorship  relationship  by 
itself  appeared  to  provide  a  highly  fragmented  clustering  effect, 
experimental  effort  was  changed  to  the  use  of  citations  (literature) 
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so  that  results  could  be  correlated  more  obviously  with  known  facts. 

Of  particular  significance  is  the  devising  of  a  "path-tracing"  algorithm 
to  determine  "connectivity"  of  items  of  interest.  Appendix  D  of  this 
report  describes  the  work  of  this  quarter  on  "clustering". 

Other: 

In  November  1961  a  Facilities  Contract  was  executed  with 
the  Electrada  Corporation  for  an  off-line  editing  and  message  com¬ 
posing  keyboard  and  display  console.  This  equipment  was  delivered 
to  the  Thomas  J.  Watson  Research  Center  on  9  July  1962  where  it  is 
presently  undergoing  test.  Volume  II  of  this  fourth  quarter  report 
will  be  submitted  to  cover  results  of  theSe  tests. 
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Appendix  A 


Overall  Design  Philosophy  of  the  AN/GYA-(  )  Computer  Storage  Integration 

Group 


Introduction: 

Large  scale  computer  installations  typified  by  the  IBM  7090'' 
have  been  profitably  exploited  by  many  organizations  in  the  processing 
of  scientific,  engineering,  business,  and  other  Information,  Until 
recently  only  a  small  fraction  of  the  total  personnel  in  a  typical 
organization  have  been  able  to  directly  utilize  these  machines,  albeit 
in  a  highly  efficient  manner.  The  advent  of  symbolic  programming, 
source  problem  programming  languages,  and  compilers,  have 
extended  the  power  of  these  processing  complexes  to  the  service  of 
many  more  people.  However,  in  the  realm  of  information  processing 
where  human  mental  activity  needs  continuous  assistance,  rather 
than  intermittent  assistance  or  replacement,  computers  are  not  used 
to  the  extent  that  the  potential  of  some  of  their  characteristics 
would  indicate  should  be  possible.  Primary  deterrents  to  more 
complete  exploitation  of  machine  information  processors  would 
appear  to  be  the  essentially  unilateral  and  discontinuous  time¬ 
batching  method  by  which  men  are  forced  to  communicate  and  operate 
with  computers,  even  when  complete  familiarity  with  the  language 
and  procedures  of  this  communication  are  known.  This  is  of  course 
the  result  of  unwillingness  of  most  reasonable  managements  to 
allow  "tying  up"  of  the  totality,  or  even  large  portions  of  these 
powerful  processors,  either  on  a  time  or  equipment  basis,  while  one 
man  "thinks"  about  his  next  request  or  response  to  the  machine.  The 
need  for  discontinuous  but  "on  demand"  use  of  a  computer  is  keenly 
felt  by  programmers  when  they  attempt  to  "de-bug"  new  programs. 
This  need  is  further  appreciated  by  scientific  and  engineering 
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personnel  who  frequently  would  like  a  more  continuous  interplay  with 
the  machine  during  the  processing  of  a  problem  even  though  this 
inter -re action  is  interspersed  with  pauses  in  machine  activity  on  that 
problem  to  allow  human  thought  and  possible  redirection  of  machine 
activity  (see  for  example  reference  1).  Similarly,  in  mixed  machine 
and  human  information  processing,  typified  by  military  intelligence 
analyst  activity,  there  would  appear  to  be  a  great  need  for  this 
discontinuous  but  "on  demand"  service  by  a  computer  complex. 

It  is  extremely  doubtful,  except  for  operational  situations 
involving  a  few  major  specific  functions,  that  complete  dedication  of 
a  large  processor  to  this  "on  demand"  service  is  warranted.  This 
would  be  particularly  true  if  the  valuable  services  now  provided  to 
an  organization  on  an  efficient  basis  are  seriously  curtailed.  It  is  to 
the  obtainment  of  a  compromise  equipment  configuration,  useful  in 
an  intelligent  information  processing  environment  where  the  "bread 
and  butter"  "batched  processing"  tasks  of  the  large  scale  processor 
are  affected  only  slightly  and  "on  demand"  service  is  also 
conveniently  and  efficiently  provided,  to  which  the  development 
program  on  the  AN/GYA-(  )  is  directed. 

Postulation  of  Processing  Environment: 

The  postulation  of  an  intelligence  information  problem 
environment  determines  that  existing  processing  needs  include 
operation  on  numerical  computational  problems,  logical  and  lexical 
operations  on  information  in  language  form,  and  retrieval  of 
information  items  from  large  scale  storage,  as  well  as  problems 
having  mixtures  of  all  three  processing  categories.  Indeed,  if 
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processing  problems  could  be  uniquely  categorized  then  specialized 
hardware  dedicated  to  each  type  of  problem  would  be  a  feasible  and 
possibly  economical  approach  to  the  goals  of  this  endeavour. 

However,  the  following  type  of  problem  is  presented  as  an  example 
of  a  desirable  type  of  service  that  must  be  provided:  an  analyst,  (let 
us  say  a  political  specialist),  enters  into  the  computer  complex  ten 
names  that  have  come  to  his  attention  by  reference  to  some  recent 
literature.  He  asks  that  all  available  news  items  (perhaps  in 
abstracted  form)  relative  to  these  people  be  provided.  The  processor 
replies,  via  a  suitable  display  that  approximately  500  newspaper 
articles  are  involved.  The  analyst  decides,  after  some  thought,  that 
he  would  be  satisfied  Initially  to  have  all  items  which  are  source 
dated  after  1  January  I960.  He  is  informed  of  the  number  of  items 
in  the  file  response,  or  at  least  if  this  number  is  greater  say  than 
that  which  can  be  conveniently  handled  by  the  display  screen.  Let  us 
say  it  isn't,  and  he  requests  "display",  rather  than  hard  copy  print 
out.  After  examining  the  screen  display  he  decides  that  two  of  these 
people  appear  to  have  similar  knowledge  on  a  subject  of  interest  to 
him.  He  requests  from  a  particular  file  the  biographical  data  on 
these  men  and  selects  from  this  as  well  as  from  the  previous  news 
items  a  set  of  criteria  that  can  be  used  to  determine  possible 
relationships  between  these  people  and  others  who  may  be  noted  in 
the  file.  He  may  select  for  example  specific  "location-time  intervals" 
from  the  biographical  data  and  then  ask  the  computer  to  retrieve  from 
a  "location-time"  organized  event  file,  all  items  listed  under  the 
intervals  that  he  selected,  specifying  perhaps  some  logical 
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relationship  of  co-occurrence  of  an  action  or  event  within  each 
retrieved  "location-time"  interval.  At  this  point  some  internal 
computer  sampling  procedure  should  be  accomplished  by  the  computer 
in  order  to  place  a  bound  on  the  computer  "tie-up"  on  any  specific 
request.  After  the  computer  informs  the  user  of  the  estimated 
magnitude  of  items  to  be  handled  as  a  result  of  his  request,  the  user 
may  instruct  the  computer  to  proceed  with  a  clustering  matrix  or 
"graph  tracing"  operation  which  then  provides  him  with  a  grouping  of 
names,  each  group  having  a  decreasing  degree  of  "connectivity"  by 
virtue  of  their  relationship  reported  in  the  event  file.  At  this  point 
the  analyst  requests  a  file  time  and  CPU  time  accounting  of  his  use 
of  the  complex  and  decides  to  ask  for  a  print-out  so  that  he  can 
submit  a  "batched"  request  for  complete  biographical  and  news  item 
print  out  of  either  the  pertinent  document  references  or  the  actual 
source  material,  so  that  he  may  assess  the  reasons  or  "common 
purposes"  that  connect  the  groups  of  people  (if  such  a  common 
purpose  exists). 

The  above  description,  although,  fictitious,  is  believed 
realizable  and  typical  of  service  that  is  needed  in  many  human 
information  processing  environments.  The  question  of  course  now 
arises  "So  it  can  be  done,  what  happens  to  the  processing  of  the 
daily  organizational  reports;  or  worse,  the  payroll?"  The  answer 
to  this  question  involves  consideration  of  processor  throughput  in 
terms  of  the  number  of  concurrent  "on  demand"  users  and  the 
statistical  characteristics  of  their  queries  in  terms  of  processing 
load,  file  search  load,  input/output  operation  load,  and  so  forth. 
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We  obviously  don't  know  and  presently  have  no  reasonable  means  of 
determining  quantitative  assessments  of  these  factors  until  such  a 
service  is  provided  at  least  in  development  form.  Accordingly  we 
have  arbitrarily  set  some  initial  criteria  for  our  research  and 
development  program  to  provide  "bounds"  on  what  is  done  to  assess 
this  problem  area, 

(1)  The  characteristics  of  the  central  processor  (IBM  7090), 
at  least  initially,  will  be  unmodified  per  se. 

(2)  "De-bugged"  programs  only  will  be  allowed  within  the 
processor  during  the  "on-demand"  service  experimentation.  (Later 
in  the  program,  hardware  modifications  for  memory  allocation  and 
protection  will  be  examined  in  order  to  provide  greater  flexibility  in 
this  respect. 

(3)  More  than  one  concurrent  "on  demand"  user  will  be 
hypothesized  for  the  development  program  in  order  to  provide  for 
generalizing  the  usefulness  of  experimental  results. 

(4)  Assessment  of  expected  impact  on  "bread  and  butter" 
program  operation  in  terms  of  "on  demand"  users  will  be  a  goal  of 
the  program. 

Processor  "Throughput"  Discussion: 

Obviously  any  specified  computer  equipment  configuration 
has  limits  as  to  the  number  of  "on  demand"  programs  that  it  can 
execute  in  a  specified  length  of  time.  The  portion  of  the  computer 
complex  that  fixes  such  a  limit  will  depend  on  the  predominant 
characteristics  of  the  processing  that  the  problem-mix  requires. 

The  following  possibilities  exist  as  unique  or  joint  occurrences  in  a 
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central  processor  type  computer. 

Predominant  Characteristics  Possible  Computer  -  System 

of  Processing  Required  Limitations _ 

1.  Extensive  computations  Basic  Core  Memory  and  CPU 

cycle  time  of  a  central  computer. 

2.  High  Input/Output  (I/O)  data  I/O  transfer  cycle  time 

flow  rates.  availability. 

3.  Frequent  references  to  External -storage  average  item 

peripheral  storage  devices.  access  time  and  availability  of 

peripheral  storage  capacity. 

4.  Extensive  transfer  within  High-speed  memory  capacity  or 

programs  to  sub-routines  storage  average  item  -  access 

and  other  non-contiguous  time  of  external  storage, 

program  segments. 

5.  Extensive  non-contiguous  High-speed  memory  capacity 

data-block  branching.  or  average  item  access  time 

of  external  storage. 

6.  Many,  I/O  device  to  I/O  I/O  data  transfer  rate  or  I/O 

device,  data  transfer  transfer  cycle  time  availability, 

operations. 

The  above  list  pertains  to  a  single  central  processor  of  a 
serial  processing  type  (with  overlap  of  input/output  and  CPU 
operations  allowed).  The  list  for  distributed  systems,  that  is, 
processors  with  multiple  or  distributed  processors  (or  component 
processing  units)  would  be  different,  with  such  limiting  factors 
appearing  as  "intra-computer  switching  and  communication  rates"  or 
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"supervisory  program  operation  rates". 

Although  quantitative  assessments  can  not  be  realistically 
made  without  experimental  trial  of  the  actual  operation  of  an  "on 
demand"  computer  service,  qualitative  assessments  are  necessary 
to  produce  a  design  approach  for  such  experimentation.  Review  of 
the  above  list,  keeping  in  mind  an  intelligence  environment  for 
operation,  has  allowed  the  following  guide  lines  to  be  drawn: 

(1)  Extensive  logical  operations  or  computations:  This 
problem  characteristic  is  more  typical  of  scientific  problems  and 
would  not  be  expected  as  a  predominant  characteristic  of  the  "mix" 
of  intelligence  processing  problems  (such  problems  will  occur  of 
course). 

(2)  High  Input/Output  (I/O)  data  flow  rates:  If  the  "on- 
demand"  processing  service  were  to  be  provided  to  a  multiplicity  of 
high  data  rate  devices  such  as  radar  sets  or  multiplexed  communi¬ 
cation  data  channels,  then  I/O  transfer  cycle  time  could  become 
saturated.  Generally  speaking,  if  "on  demand"  computer  service 
is  to  be  provided  primarily  to  human  beings  the  superimposed  data 
transfer  rates  required  for  communications  from  these  human 
beings  would  be  negligibly  small.  It  is  true  however  that  unless 
some  measure  of  internal  control  is  provided,  a  few  "on  demand" 
queries  for  information  could  unnecessarily  cause  a  saturation  of 
available  I/O  channels  if  they  required  total  file  contents  to  be 
scanned  by  the  central  processor  unit  for  matching  query  criteria. 
This  is  a  problem  that  must  be  circumvented  in  an  "on  demand" 
system  design. 
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(3)  Frequent  references  to  peripheral  storage  devices:  This 
problem-processing  characteristic  is  typical  of  information  retrieval 
application,  input-data  transformation  to  machine  proces sable  form, 
and  processing  that  involves  frequent  references  to  pre-computed 
data  tables  or  tables  of  constants.  All  of  these  processing 
characteristics  would  be  typical  of  Intelligence  and  other  intellectual 
information  processing  environments. 

(4)  Extensive  transfers  within  programs  to  subroutines  and 
other  non-contiguous  program  segments:  This  problem  does  not 
appear  seriously  in  the  normal  "batched  processing"  use  of  a 
computer  (except  in  some  compiler  operations)  since  large  contiguous 
segments  or  even  total  programs  are  usually  brought  into  main  high 
speed  core  memories  for  rapid  reference.  In  this  case,  overlapping 
operations  of  several  programs  can  be  efficiently  accomplished, 

and  the  occasional  reference  to  a  special  macro-program  or  sub¬ 
routine  via  I/O  operations  need  not  produce  great  loss  in  computer 
throughput.  If  however,  "on  demand"  service  to  many  users  each 
with  different  program  requirements  is  considered  the  capacity  of 
even  a  large  high  speed  core  memory  may  be  «xceede.d.  Teager  and 
Morse  (see  reference  2)  working  toward  similar  "on  demand" 
computer  service  goals  in  an  engineering  and  scientific  environment 
have  noted  that  greater  high  speed  memory  capacity  could  be 
effectively  provided  for  processing  purposes,  if  the  amount  of  high 
speed  memory  required  to  contain  the  programs  could  be  reduced. 
Their  preliminary  assessments  showed  that  over  20%  of  the  machine 
time  of  a  particular  computing  complex  goes  to  programs  which  take 
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either  greater  than  16000  words  of  memory  (core)  or  require  more 
than  15  minutes  running  time.  Their  analysis  further  indicated  that 
on  the  average  only  three  or  four  programs  can  be  run  simultaneously 
in  a  32K  word  core -memory  machine,  and  if  the  alternative  of 
processing  each  program  on  a  priority  sequence  basis  were  followed, 
unacceptable  delays  would  be  encountered  by  individual  users. 

(5)  Extensive  non-contiguous  data-block  branching:  This 
type  of  problem  arises  in  many  instances  of  processing  data,  each 
item  of  which  requires  a  processing  reference  to  a  data  item  not 
contained  in  a  contiguous  block  of  data  held  in  the  high  speed  core 
memory.  Again,  processing  of  information  that  is  heavily  "cross 
linked"  with  other  items  in  a  total  file,  linquiBtic  operations  such  as 
synonym  and  word-phrase  linking,  and  retrieval  of  information  based 
on  "chain"  storage  where  connected  or  relevant  items  are  connected 
by  "see  also"  or  "see  next"  linkages,  all  lead  to  extensive  data 
branching.  Here  again,  as  in  the  frequently  branching  or  the  highly 
segmented  type  of  program  problem,  limitation  will  be  imposed 
either  by  the  ability  of  the  high  speed  core  memory  to  hold  all 
pertinent  data  for  all  the  programs,  or  by  the  time  consumed  in 
finding  information  in  external  storage, 

(6)  Many,  I/O  device -to-I/O  device,  data  transfer 
operations:  An  example  of  this  type  of  processing  problem  might 
be  the  transfer  of  large  quantities  of  digitally  stored  information 
from  say  tape  units,  through  the  CPU  for  an  editing  process,  to  an 
output  printer.  This  type  of  operation  is  a  prevalent  one  in  large 
processing  complexes  and  frequently  is  assigned  to  relatively 
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independent  smaller  scale  peripheral  data  processes.  In  any  event 
many  of  these  operations  occur  in  problems  susceptible  to  "batched" 
processing  operations,  and  are  frequently  handled  that  way. 

Computer  Through-Put  Considerations: 

In  reviewing  the  foregoing  categories  of  probable  processing 
characteristics  of  problems  in  an  intelligence  operation,  it  would 
appear  that  categories  (3),  (4)  and  (5)  require  special  attention  in  the 
design  of  an  "on  demand"  computer  service.  Item  (3),  frequent 
references  to  peripheral  storage,  in  particular  may  be  effected  not 
only  by  the  problem  environment  (through  problem  characteristics 
(4)  and  (5))  but  also  by  methods  of  achieving  "on  demand"  service 
from  the  computer  complex.  Since  the  amount  of  high  speed  core 
memory  appeared  to  be  an  important  possible  limiting  factor  on 
the  number  of  contiguous  programs  (and  associated  data  blocks) 
that  can  be  operated  upon  "concurrently"  (on  a  dense  time  multiplexed 
basis)  ;  methods  that  could  effectively  augment  the  available  high 
speed  memory  were  explored.  Direct  supplementation  of  the  high 
speed  core  memory  of  course  is  possible  and  if  core  memory 
augmentation  factors  of  up  to  two  were  all  that  were  required,  this 
would  be  perfectly  feasible  by  engineering  addition  to  the  computer  of 
a  parallel  memory  unit.  However,  high  speed  read/write  memories 
are  the  most  expensive  form  of  computer  memories  for  storage  of 
sparsely  used  data,  on  the  order  of  one  dollar  per  bit  in  cost  in  an 
operational  set-up.  An  alternative  was  to  seek  a  peripheral  storage 
device  whereby  the  complete  program  repertoire  of  the  computer 
installation  would  be  contained,  and  needed  portions  thereof  would  be 


24 


available  for  quick  transfer  as  needed  Into  the  main  core  memory. 
Now  the  amount  of  storage  capacity  needed  for  the  routine  program 
repertoire  of  say  an  IBM  7090  in  a  scientific  computing  center  may 
be  as  low  as  one  to  two  million  bits;  this  would  be  the  case  where  a 
great  majority  of  the  operating  programs  are  maintained  by  the 
various  users,  external  to  the  computer  center,  and  where  no  "on 
demand"  service  is  provided.  On  the  other  hand,  cursory  review 
of  the  total  published  library  of  programs  for  the  IBM  704  system 
(704  Share  Abstracts,  May  15,  I960)  representing  programs 
developed  for  a  great  variety  of  customer  applications  in  the 
scientific  area  indicates  that  a  versatile  program  library  may  fall 
in  the  ten  million  bit  category,  or  greater.  It  would  seem  logical 
to  assume  that  an  "on  demand"  computer  service  should  provide  for 
the  use  of  a  wide  variety  of  program  routines  and  that  peripheral 
storage  capacity  on  the  order  of  10^  bits  should  be  a  developmental 
criterion  if  program  repertoire  storage  only,  exclusive  of  other 
functions,  were  to  be  considered. 

5 

The  IBM  7090  can  execute  on  the  order  of  2  X  10  program 
instructions  per  second  (on  a  non-data-limited  basis).  Therefore 
any  scheme  of  effectively  augmenting  high  speed  memory  capacity 
so  that  multiple  programs  can  be  executed  on  a  dense  time  shared 
basis;  must  consider  the  effect  of  retrieval  and  "read-in"  of  the 
program  instructions.  Unfortunately,  extensive  statistical  data 
on  the  relationship  between  program  segment  size  and  the  average 
number  of  program  transfers  outside  each  such  program  segment, 
is  not  available.  From  Research  on  instruction  mixes  for  the 
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IBM  7094  program  one  would  expect  that  the  probability  density  (per 
word)  of  a  sequential  operating  instruction  being  outside  a  program 
segment  whose  size  equals  two  instructions,  to  be  about  0.  55  and  this 
probability  should  eventually  reach  a  value  close  to  the  reciprocal 
of  the  program  size  for  program  segments  that  approach  average 
maximum  program  size.  Review  of  the  above  referenced  Share 
program  abstracts  for  the  IBM  704  indicates  that  special  sub-routines 
or  essentially  self-contained  programs  (with  exception  of  compilers) 
average  about  200  instruction  words  each.  Assuming  that  primarily 
the  terminal  instruction  in  these  routines  involves  transfer  to  an 
instruction  outside  a  200  word  average  set  or  segment,  then  the 
probability  density  per  word  of  encountering  a  non-included 
instruction  would  be  on  the  order  of  0.  01  for  a  200  word  program 
segment  size. 

With  this  probability  of  requiring  retrieval  of  another 

program  segment  from  a  peripheral  storage  device  (assuming  that 

each  additional  segment  needed  does  not  also  reside  in  the  high 

speed  memory)  then  a  program  segment  access  time  of  about  a 

millisecond  (per  segment)  would  be  required  for  acomputer  that 

5 

averages  on  the  order  of  10  instructions  "consumed"  per  second. 
This  estimate  is  predicated  on  the  program-segment  retrieval 
operation  which  does  not  appreciably  lower  the  average  useful 
program  instruction  execution  rate  and  that  the  computer  need  not 
wait  for  one  program-segment  in  order  to  continue  execution  of 
other  program  instructions.  Several  practical  factors  tend  to 
reduce  the  rate  at  which  a  program's  total  instructions  are 
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"consumed".  First,  programming  makes  extensive  use  of  highly 
efficient  reiterated  "loops"  or  short  sequences  of  instructions  and 
this  reduces  the  effective  rate  at  which  the  computer  moves  through 
a  programs'  total  of  instructions.  Second,  a  certain  amount  of 
within  program  "housekeeping"  functions  involving  say  transfer  of 
data  within  memory,  etc.  may  further  reduce  the  effective  rate  of 
total  program  instruction  consumption. 

These  factors  may  cause  a  program  instruction  "consumption 
rate"  (not  instruction  execution  rate)  to  drop  to  about  one  fifth  or 
even  as  much  as  one  tenth  of  the  basic  instruction  execution  rate  of 
the  computer  (based  on  average  2  cycles  per  instruction).  If,  and 
the  speculative  nature  of  this  cannot  be  overemphasized,  the  above 
assumptions  are  approximately  correct,  then  the  program  instruction 
"consumption  rate"  would  be  about  20,000  per  second,  as  averaged 
over  the  total  program  "mix"  of  a  computer  complex.  A  program 
segment  length,  selected  to  be  about  256  instruction  words  in  length 
would  lead  to  the  probability  density  per  word  that  a  transfer  of 
instruction  sequence  outside  a  segment  will  occur,  of  about  0.  01. 

This  estimate  is  based  upon  sparse  data,  and  attempts  to  obtain 
further  information  on  this  will  be  made. 

The  fraction  of  machine  capacity  that  should  be  dedicated  to 
"on  demand"  service  must  be  determined.  At  this  time  it  is 
impossible  to  predict  the  effect  of  segmentation  of  program  data, 
but  if  one  arbitrarily  sets  the  limit  that  less  than  10%  of  the 
program  instruction  consumption  rate  be  dedicated  to  the  "on  demand" 
service  then  the  following  constraints  should  hold  as  an  approximation: 


27 


(1) 


„  =  [P(L)  *  I  *  Kg] 


-1 


(2) 


where:  K  =  fraction  of  program  instruction 

a 

consumption  rate  allocated  to 
program  segment  retrieval. 

P(L)=  probability  density  per  program 
instruction  that  an  "outside"  seg¬ 
ment  instruction  will  be  required. 
(Segments,  L  words  long) 

I  =  average  program  instruction  word 
"consumption  rate"  (per  Becond) 
t  =  average  time  required  to  obtain 
(in  a  main  core  memory  position) 
program  segments  of  length  L 
from  peripheral  storage. 

t  =  t  +t  +t  :  average  total  time  to  retrieve 

s  1  2  3 

a  program  segment. 

where:  t  ^  =  average  time  required  to  address 

and  locate  a  desired  program  seg¬ 
ment  in  a  peripheral  storage  device. 
t  =  average  time  to  read  out  a  selected 

M 

program  segment  into  a  buffer 
Btorage. 

=  average  time  to  read  selected  pro¬ 
gram  segment  from  buffer  to  main 
high  speed  core  memory. 


Combining  (1)  and  (2) 


(3) 


P(L)  •  I  •  K 


<vv 
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assume  the  following: 

i 

K  =  0.  1 

8 

L  =  256  words  (36  bits  each)  of  program 

instruction  words. 

P(L)  ~  .  01  transfers  per  program  word 
"consumed". 

I  ^  20,  000  instructions  "consumed"  per 

second. 

-6  ,♦ 

T  ~  0.009  seconds  (1  x  10  seconds/bit  x 

Lt 

256  words  x  36  bits /word). 

T  S5  0.  008  seconds  (256  words  x  30  x  10  ^ 
seconds/ word). 

Then  T  ^  (average  time  to  locate  a  desired 
program  segment  in  the  peripheral  storage 
unit)  becomes  from  (3):  T  ^  fi:  33  milliseconds 
*  1.0  bit  per  microsecond  is  an  assumed  high 
speed  peripheral  storage  data  transfer  rate. 

♦♦This  assumes  an  average  of  14  machine  cycles 
per  36  bit  word  transferred  via  an  I/O  device. 


Obviously,  if  and  T  that  is  the  time  to  address 
and  locate  a  program  segment  and  the  time  to  read 
it  out  of  buffer  storage  can  be  overlapped,  then  (3) 
becomes  (assuming  T  T  ^  ): 
t  _  _ 1 _  T 

1  =  P(L)  •  I  •  K  "  2 

s 
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1 

t  =41  milliseconds 

Alternatively,  in  this  latter  case,  if  the  addressing 
time  (  )  1°  maintained  at  33  milliseconds  then  the 

fraction  of  machine  cycles  used  for  "on  demand" 
service  can  be  upped  from  10%  to  14%. 

If  auxiliary  program  storage  alone  functionally  dictated  the 
desired  characteristics  of  an  auxiliary  storage  device  for  use  in 
providing  an  "on  demand"  user  service,  the  aforementioned  storage 
characteristics  could  be  satisfied  by  technology  which  is  already  in 
an  advanced  development  or  operational  status  (say  by  magnetic  drum 
storage  devices).  One  should  expect  however  that  the  developmental 
efforts  toward  providing  such  service  should  anticipate  that  each  user 
will  eventually  be  allowed  considerable  flexibility  with  respect  to 
the  terms  (at  least)  if  not  the  language  which  he  uses  to  communicate 
with  the  computer  and  its  stored  information,  without  imposing 
additional  processing  load  on  the  existing  computer  complex.  This 
implies  large  dictionary  storage  and  language  processing  capabilities. 
For  this  reason  the  basic  technique  and  equipment  employed  in  the 
USAF  AN/GSQ-16  language  processing  system  were  selected  as  a 
key  functional  element  of  the  equipment  complex  needed  to  achieve 
the  goals  of  this  program.  This  equipment  (referred  to  as  the 
Photostore)  currently  has  a  storage  capacity  of  60  million  bits  of 
information.  Item  retrieval  times  on  the  order  30  milliseconds  are 
now  achieved.  The  basic  technique  potentially  is  capable  of  greatly 
expanded  capacity  and  access  times  on  the  order  of  10  milliseconds. 


Additionally,  this  equipment  haa  operationally  demonstrated  a 
useful  capability  to  translate  between  languages,  a  desirable 
characteristic  if  successfully  exploited  for  solution  of  man-machine 
communication  problems. 

The  AN/GYA-IBM  7090  Relationship: 

Two  basic  methods  of  connecting  the  photostore  to  an  IBM 
7090  as  an  auxiliary  program  segment  store  (as  well  as  a  possible 
lexical  processor)  were  examined.  In  each  approach,  hardware 
methods  of  program  relocation  and  memory  protection  were  assumed 
to  be  possible  of  achievement,  even  though  they  were  not  included  in 
the  connection  method  study.  Because  the  photostore  is  an 
asynchronous  and  independently  operable  device  some  data  buffer- 
storage  was  required  in  the  path  of  interconnection  between  the 
photostore  and  the  IBM  7090.  A  maximum  possible  length  of  program 
segmentation  of  one  thousand  words  was  assumed  and  a  minimum 
buffer  size  of  two  thousand  (36  bit)  computer  words  was  assumed 
(on  the  basis  of  providing  an  alternate  program  segment  choice  to 
the  IBM  7090).  Further  consideration  of  possible  uses  of  the 
photo  store -buffer  combination  in  conditional  table -look-up  processing 
of  data  (such  as  this  combination  serves  in  the  AN/GSQ-16)  may 
make  a  larger  buffer  desirable,  at  least  for  experimentation 
purposes. 

The  two  methods  of  interconnection  which  were  examined  for 
the  foregoing  purposes  were  (1)  Direct  Data  Channel  Connection  and 
(2)  Shared  Memory  Connection.  The  first  of  these  connections  was 
finally  considered  as  the  best,  although  each  method  of  connection 
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is  briefly  discussed  below. 

Direct  Data  Channel  Connection  -  This  method  of  connection 
was  discussed  in  Volume  III  of  the  second  quarterly  report  on 
AF19  (626)-10.  In  review,  an  IBM  7607  Model  II  Data  Channel 
would  be  modified  to  transmit  to  the  IBM  7090,  a  maximum  of  one 
36  bit  word  every  6.  54  microseconds  (based  upon  no  other  I/O  channel 
activity. )  With  three  I/O  channels  on  a  complex  using  I/O  cycle 
times  on  a  maximum  basis,  the  slowest  rate  of  transfer  of  data 
retrieved  from  the  photostore  would  be  about  one  36  bit  word  every 
54  microseconds.  A  "nominal"  figure  of  1.  2  bits  per  microsecond 
(one  word  every  14  machine  cycles)  transfer  rate  can  be  assumed. 

This  modified  data  channel  connection  is  referred  to  as  the  Direct 
Data  Channel  Connection.  Of  considerable  significance,  with  respect 
to  this  connection,  is  the  fact  that  in  transferring  a  block  of  data  or 
program  words  from  the  photostore  to  the  high  speed  working  memory, 
only  one  machine  cycle  per  36  bit  word  is  required  (even  though  this 
may  occur  every  14  machine  cycles). 

Shared  Memory  Connection  -  Engineering  design  has  been 
accomplished  that  allows  connection  of  an  additional  7302  core 
memory  to  an  IBM  7090,  that  effectively  provides  a  64  thousand 
word  total  (36  bit  word)  high  speed  core  memory  for  the  IBM  7090. 
With  this  added  memory  attachment  the  necessary  hardware 
modifications  to  the  IBM  7090  are  provided  such  that  at  the  choice 
of  the  operator  (or  the  program)  the  computer  can  operate  using 
either,  or  both,  of  the  memories.  If  one  now  considers  the  sentence 
analyzer  memory,  of  the  AN/GSQ-16  Lexical  Processor  (a  smaller 
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core  memory  having  electrical  characteristics  similar  to  those  of 
the  IBM  7090  core  memory)  as  being  used  as  this  additional  added 
memory,  then  the  IBM  7090  and  the  Photostore  would  communicate 
via  the  placement  of  data  look-up  addresses  (entries)  or  retrieved 
data  in  this  common  additional  memory. 

The  two  methods  of  connection  briefly  outlined  above,  each 
has  particular  advantages  and  disadvantages,  particularly  when 
examined  from  the  point  of  view  of  each  of  the  various  intelligence 
data  processing  requirements  such  as  lexical  processing,  information 
retrieved,  and  "on  demand"  service.  Since  the  latter  objective  is  a 
primary  goal  of  this  R  &  D  program  the  following  factor  strongly 
influenced  the  final  decision  to  exploit  the  Direct  Data  Channel 
connection.  In  the  Shared  Memory  Connection,  operation  on  program 
words  retrieved  from  the  Photostore  would  require  in  many  cases  the 
transfer  of  program  instruction  words  from  the  shared  memory  into 
the  main  core  memory  of  the  IBM  7090.  This  operation  would  require 
the  employment  of  a  minimum  of  six  machine  cycles  and  possibly  up 
to  IE  machine  cycles  for  each  instruction  word  transferred.  This 
appeared  to  be  a  serious  "overhead"  to  pay  in  terms  of  useful  machine 
cycles.  In  addition  to  the  foregoing,  the  future  utility  of  a  connecting 
design  which  involved  a  "non-standard"  and  technically  intimate 
connection  between  two  specific  memory  structures  was  considered 
suspect,  whereas  the  Direct  Data  Channel  Connection  offered  greater 
flexibility  of  "tie-in"  with  variations  of  structure  in  the  IBM  7000 
series  computers  (such  as  the  higher  speed  IBM  7094  Data  Processing 
System  which  is  compatible  with  the  IBM  7090  Data  Processing  System). 


33 


To  implement  the  basic  equipment  group  (the  AN/GYA-(  )  ) 
to  reach  the  goals  of  this  program,  a  more  flexible  method  of 
communicating  between  the  computer  complex  and  the  "on  demand" 
user  was  required.  Fortunately,  the  U.  S.  Air  Force  has  previously 
sponsored  development  of  a  Communications  Control  Console, 
capable  of  being  connected  to  an  IBM  7090  tape  unit  connection. 
Although  the  eventual  optimum  characteristics  of  such  a  console 
for  "on  demand"  use  in  a  computer  complex  are  not  presently 
known,  this  device  which  has  typewriter  key-board,  selected 
operation  keys,  and  cathode  ray  tube  display,  should  provide  a 
good  experimental  capability. 

The  basic  configuration  of  the  AN/GYA-(  )  Computer  Storage 
Integration  Group  which  will  be  developed  and  integrated  with  an 
IBM  7090  are  shown  in  figure  1. 
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Appendix  B 


Optimum  Search  Strategies 


Introduction 

We  consider  a  store  consisting  of  N  cells!  with  information 
stored  in  tabular  form.  That  is,  the  record  r(i)  stored  in  cell  i  is  in 
the  form  of  argument-function  x^  f(x. ),  the  file  being  arranged  in 
ascending  order  of  the  argument  x.  Given  a  particular  argument  x, 
one  proceeds  in  searching  for  the  cell  containing  xf(x)  by  comparing 
x  against  the  arguments  in  a  sequence  of  cells  i  j,  i^»  •  •  •  This 
sequence  of  cells  is  to  be  chosen  so  that  the  average  number  of 
comparisons  needed  to  locate  the  correct  cell  is  a  minimum. 


Assumptions 

(1)  We  assume  that  for  a  comparison  of  x  against  x^,  one 
of  the  following  three  outcomes  is  possible: 

X>Xj,  X  <  Xj,  X  ■  Xj. 

(2)  Let  ^  be  a  random  variable  denoting  the  location  of 
x.  We  assume  the  a  priori  distribution  p^  is  given, 
where: 


Pk 

W 

Z  Pk  " 

k-1 


Prob  f  ^  *  k  ^ 

1 


(1) 


(3)  Let  S  be  the  set  of  integers  1  through  N,  and  lettr  be 
any  non-empty  subset  of  S.  We  assume  that  the 
a  posteriori  probability  distribution  of  is  un¬ 

changed  except  for  normalization,  i-e. , 
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Prob  [  j  -  k  |  |«  r]  s  Pk  .  k  «  <r  (2) 

rt<r) 

•where 

P(V)  =  LPi  (3) 

i*r 


Formulation 

Let  T  £  (pj^.),  N]  formally  denote  the  minimum  average  number 
of  comparlaons  necessary,  given  the  distribution  (p^)  and  the  number 
of  cells  N.  It  is  to  be  emphasized  that  T  is  not  a  function  of  p^  per  se, 
but  is  rather  a  functional  of  the  distribution,  i.  e. ,  the  whole  set  of 
p^'s.  It  is  clear  that  the  first  step  is  to  select  a  cell  for  the  first 
comparison.  Suppose  we  select  cell  n,  then  the  following  situation 
results  from  the  comparison: 

(a)  There  is  a  probability  pn  that  x  ■  x^  and  the  search 
terminates. 

(b)  There  is  a  probability  Pr  j  =  p^  that  x^x, 

then  x  must  be  contained  somewhere  in  the  first 
n-1  cells.  Now  if  we  renumber  the  cells  backwards 
starting  with  n-1  as  the  first  cell,  the  new  distribution 
becomes: 


(c) 


There  is  a  probability  1-P^  ■ 

x  <  x,  and  upon  renumbering 

“  P 

p"  .  n+k 


(4) 

N 

S+l  Pi  that 

the  new  distribution  becomes 

(5) 
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Now,  whatever  the  choice  of  the  first  comparison  is  succeeding 


choices  must  remain  optimum  for  the  overall  choice  to  be  optimum. 


Therefore,  T  must  satisfy  the  following  minimization  functional  equation. 

T[<Pk>'  N]  ■  1  rf.N  |  1  +  P„.l 

+  (l-Pn)Tf(^).  N-n]j  <(,) 

Equation  (6)  is  a  dynamic  programming  equation  [  l]  yielding  as  solutions 

the  functional  T(. ,  . ),  and  n  ),  N  J  which  minimizes  the  right  hand 

side.  As  initial  conditions  we  set  P  T(. ,  o)  *  o  and  T(. ,  1)  *  o. 

o 

The  following  distributions  are  of  particular  interest: 


(a) 


(b) 


(c) 


k  * 


k  = 


k  = 


1 

w 


k  *  1,-  *-|N. 


N(l-f  )-(l+^)+  2^k 
N(N-l) 


(d)  Pk 


N-i 
'  k-r 

A 


(i-  r)N_k 


-ly<l 

o<  y<  i 

1  <2 


The  first  is  obviously  the  uniform  distribution.  The  second  is  the 
distribution  with  constant  slope.  The  third  is  the  binominal  distribu¬ 
tion  and  the  fourth  Yule  distribution.  Thus  far,  Eq.  (6)  is  completely 
solved  only  for  (a). 
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Solution  for  Uniform  Distribution 


For  pk  ■  ^  ,  Equation  (t>)  reduces  to 

T(N)  “  1  +  {  TT  T(n'1)  +  (1“'R)  (7> 

If  we  let  f(N)  =  NT(N),  then  Equation  (7)  becomes: 

f(N)  -  N  +  [  f(n-l)+  f(N-n)].  (8) 

It  can  be  shown  that  the  solutions  are  as  follows:  (See  Appendix!) 

f(2k+*+2m- 1)  ■  2k+*  (k— i )  +  2mk  +  3m+l  ,  k-0,  1,  ...  , 

m-0,  1,  ...2*  (9) 

k+1  lc+1  1 

f(2  +2m)  -  2**1  (k— i  )  4  (2m+l)k  +  3(m+l),  k-0,  1,  ..,  . 

m-0,  1,  . . .  2-1 

(10) 


The  policy  solution  n(N)  which  yields  the  minimum  is  not  unique.  It* 

fact,  the  multiplicity  of  solutions  for  large  N  is  quite  large.  The 

complete  set  of  solution  are  as  follows: 

,k+ 1 , ,  .  _k . ,  ,  „  , 


n(2 


n(2k+1+2m- 1 )  =  2k+2j,  j-0,  1,  , 

j-m-2 

4 

For  example,  consider  N»  2  +6  »  22. 
n(22)  -  8,  9,  10,  11,...,  15. 


Zm+1  ,  m*2k“\ 

1,  ...,2H  m»2K 

(H) 

.  .,m,  m^2k“* 

1  ,k-l  *,k-l 

i  •  •  •  f  U  t  Hi"*# 

(12) 

Similarly,  n(23)  is  given  by 
n(23)  =  8,  10,  12,  14,  16. 
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Linear  Distribution 


For  p  =  f 

Pk  N(N-l)  L 


+  *  j  +  —  ,  Eq.  (6)  becomes 


T(p,  N)  = 


,  min  C  .  r 

1+linSM  i  Pn  .!  <P-N>  T  [fn  _!  (-P.N),  n  -l] 

PN-n(-P'N)T[fN-„^^N-n]  ) 


(13) 


wh"‘  p„  <e-N>  -  s  [  1  -e  ]  • 


(14) 


“d  £n<*N>'  N°1~V?N  1-i? 


(15) 


It  can  be  shown  that  T(p,  N)  as  a  function  of  p  is  piece  wise  linear, 
i.  e. 

T(p,  N)  =  a(p,  N)  -  b(p,  N)  |  pj  ,  (16) 

where  a(p,  N)  and  b(p,  N)  are  constants  over  intervals  of  p^ 

It  is  also  clear  that: 

n(p,  N)  2.  max  j  n(0,  N)  J  ,  p>  0  ,  (17) 

similarly,  since  n(p,  N)  =  N  -n(-p,  N), 

n(p.  N)  <  min  j  n(0,  N)  j  f  p  <  0  ,  (18) 

where  maximization  and  minimization  are  taken  over  the  set  of 
multiplicity  of  solution  for  p  =  0  (uniform  distribution).  From 
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earlier  result  we  find’ that: 


max  | 

n  (2k  +  *  +  2m)  J  = 

2k  +  2m  + 

1,  m  < 

2k-‘. 

= 

,k  +  1 

m  2 

2k‘l 

max  | 

[n(2k  +  *  +  2m  -l)j  = 

2k  +  2m, 

m  4 

2k-'. 

= 

,k  +  1 

“  9 

m  £ 

2k-‘ 

(19) 


(20) 


Now,  instead  of  seeking  the  optimum  solution  to  Eq.  (6),  we  can  obtain 
a  near -optimum  (for  small  p)  solution  by  using  Eqs.  (19)  and  (20) 
in  Eq.  (6).  It  can  be  shown  that  the  near-optimum  solution  To(p,  N) 
is  of  the  form 


To(p,  N)  =  ~  f(N)  - 


IPl 


1 


g(N), 


N  N(N  -1) 

where  f(N)  is  given  by  Eqs.  (9)  and  (10)  and  g(N)  satisfies  the 


(21) 


functional  equation: 

g(N)  =  (N-n+l)  f(n  -1)  -  n  f(N  -n)  +  g(n -1)  +  g(N  -n) .  (22) 

In  Eq.  (22),  n  is  given  by  Eqs.  (19)  and  (20).  The  function  g(N) 
for  some  specific  values  of  N  can  easily  be  obtained.  For  example, 

(23) 

whence  g(2*"  T  *-l)  =  0,  since  g(l)  =  0.  However,  general  solution 
of  g(N)  have  not  yet  been  obtained. 

Entropy  Measure 

It  is  often  argued  c.  f.  2  on  intuitive  grounds  that  the 
*1 

entropy  measure  —  JF,  pk*0S2Pk  rePre8ent8  approximately  the 

k«i 


g(2k  +  ^l)  =  2k  f(2k-l)  -2k  f(2k-l)  +  2g(2k-l), 
,k  +  1 
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mimimum  average  number  of  comparisons.  The  validity  of  this 


approximation  can  be  examined  using  Eq.  (6).  To  do  this,  we  change 
assumption  (1)  slightly,  and  restrict  the  outcomes  to  a  comparison  to 
x  4  x.  and  x  >  x.  thus  eliminating  the  possible  outcome  x  =  x  . 


(24) 


l  l 

As  a  result,  Eq  (6)  is  modified  and  becomes: 

T  tt>k)-  N3  =  {  1  +  Pn  T  [  •  ”1 

'  n 

+  (1  -P„)  T  [  ,  N  -n]  1 

n  ' 

Now,  if  we  substitute  for  T  £  •  ,  ‘  }  in  Eq.  (24) 

T  t  <Pk).  N  J  =  “2  Pk  1<>g2  Pk’  (25) 

fc«* 

Eq.  (24)  becomes: 

d  f  *» 

-  L  P k  ‘°82  Pk  *  .SS”  ^  1  1  ■  Z  Pk  l0S2Pk  +  P„  lo«2  p„ 

k*<  X 


or 


-  Z  Pk  l0S2  Pk  +  (1  -Pn'  lo«2  “  ■Pn)  j 

1 


min  |  1  +  P  log_  P  +  (1  -P  )  log 

14niN|  n  2  n 


2  <l  -Pn> 


=  0  (27) 


Equation  (27)  is  satisfied  if  and  only  if 

P  =1/2  (28) 

n 

In  general,  there  is  no  integer  n  which  satisfied  Eq.  (28).  Further¬ 
more,  even  if  there  is  an  integer  n  which  satisfies  Eq.  (28),  in 
general  Eq  (25)  is  still  not  a  solution  due  to  the  recursive  nature 


(26) 
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of  Eq.  (24).  In  order  for  Eq.  (25)  to  be  the  optimum  solution,  not 


only  must  there  be  an  n  such  that 
if 

U-  i 

fa  k«s« 

but  there  must  also  be  integers  J?  and  m  such  that 


»k  ■  1/2 


i 


Z  »k  ■  1/4 

(•Mtl 

and  so  on.  For  large  N  and  under  some  regularity  conditions  on 
p^,  it  is  very  likely  that  for  the  first  few  comparisons  partitioning 
into  equally  probable  subsets  can  be  accomplished.  However,  it  is 
not  clear  that  in  what  precise  sense  can  Eq.  (25)  be  regarded  as 
an  approximate  solution. 
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Discussion  and  Conclusion 


From  a  purely  theoretical  point  of  view,  the  formulation 
of  the  search  problem  as  a  functional  equation  involving  minimization 
is  quite  attractive.  First,  it  has  led  to  the  complete  solution  in 
one  simple  case  and  a  detailed  examination  of  some  approximate 
solution  in  a  second  case  Secondly,  this  approach  has  permitted 
a  precise  analysis  of  the  entropy  measure  as  an  approximate  solution 
for  the  minimum  average  number  of  comparisons.  Nevertheless, 
from  the  point  of  view  of  application,  the  purely  theoretical  approach 
has  serious  limitations.  First,  the  formulation  requires  that  the 
a  priori  distribution  be  known.  An  accurate  knowledge  of  the  dis¬ 
tribution  is  probably  difficult  in  most  practical  situations.  However, 
it  should  be  noted  that  this  difficulty  is  an  inherent  one  and  not  due 
to  the  particular  formulation  chosen.  It  is  always  possible  to  devise 
a  strategy  which  is  distribution  free,  such  as  binary  search  or  serial 
search,  in  which  case  it  almost  always  means  that  our  information 
concerning  what  has  been  stored  is  not  fully  utilized.  A  second 
limitation  is  that  implementation  of  an  optimum  strategy  would  al¬ 
most  surely  involve  some  computation  time.  If  the  computation 
time  becomes  significant  in  comparison  with  the  average  search 
time,  then  it  is  doubtful  that  the  total  time  required  will  be  a  minimum. 
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Despite  the  above  reservations,  the  importance  of  a 
rigorous  theoretical  approach  should  not  be  underestimated. 

Its  principal  usefulness  is  in  establishing  a  standard  of  performance 
against  which  any  practical  strategy  of  search  can  be  measured. 

In  this  regard,  the  most  important  open  problem  is  to  establish 
both  upper  and  lower  bounds  for  the  minimum  mean  search  time 
(or  minimum  number  of  comparisons).  It  has  been  shown  that  the 
entropy  measure  is  in  some  sense  an  approximate  Solution.  How¬ 
ever,  in  the  general  case  where  the  outcome  to  each  comparison 
is  "greater  than",  "equal  to"  or  "smaller  than",  the  entropy  measure 
is  neither  an  upper  nor  a  lower  bound. 

Future  Research 

As  the  design  of  system  for  the  integrated  complex 
progresses,  a  number  of  immediate  storage  and  searching  problems 
has  arisen  or  can  be  anticipated.  A  few  of  these  are  briefly  form¬ 
ulated  below. 

A.  The  storage  apportionment  problem  -  it  frequently  occurs 
that  a  large  amount  of  information  has  to  be  stored,  some  of  which 
are  to  be  retrieved  more  frequently  than  others.  The  question  then 
arises  as  to  what  proportion  of  the  information  should  be  stored  in 
a  rapid -access,  high-cost  memory  (e.g.  core)  and  what  fraction 
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should  be  stored  in  a  slow-access,  low-cost  memory  (e.  g.  tape) 
to  achieve  a  favorable  balance  of  cost  and  speed.  Although  the  gen¬ 
eral  problem  is  open,  a  great  deal  can  be  inferred  from  existing 
results.  The  following  example  will  serve  to  illustrate  the  approach 
that  can  be  used.  Suppose  that  a  dictionary  containing  N  entries 
is  to  be  stored  in  a  combination  of  core  storage  and  tape.  Assume 

that  the  entries  are  arranged  in  decreasing  order  of  frequency  of 

th  1 

use  such  that  the  jin  entry  has  a  frequency  of  use  '•v.  (Yule  Dis- 

3 

tribution  or  Zipf* s  Law).  The  problem  is  how  to  determine  the 
average  time  required  per  look-up  and  the  total  cost  of  storage,  if 
n  of  the  N  entries  are  stored  in  core  and  the  remainder  on  tape. 
Under  some  simplifying  assumptions,  this  problem  can  be  solved. 
Assume  that  the  n  entries  in  core  are  sorted  so  that  a  binary  search 
can  be  used.  It  is  estimated  that  for  each  comparison  cycle  of  a 


binary  search,  approximately  15  machine  cycles  or  30  jxaec  7090 
time  is  required.  Assume  that  the  information  on  tape  is  again 
sorted  with  a  record  length  of  k  entries.  This  immediately  implies 
that  an  additional  core  storage  for  k  entries  is  required.  It  is  assumed 
that  entries  are  read  off  the  tape  in  blocks  of  .  k,  and  a  binary  search 
is  then  used  on  the  k  entries.  Each  entry  is  assumed  to  be  5  ma¬ 
chine  words  long  (180  bits).  The  approximate  average  time  per 
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lookup  is  easily  calculated  to  be: 


I 

T  S' .  03  log2  n  j  =  1 


,N 


N  v 

.ZLiT 


& 


+  (7.  3  +  .  5k  +  .  03  log2k)(~-H  )  j  =  n  +  1 


f 

r^i  i 


in  milliseconds,  where  a  tape  reading  time  of  (7.  3  +  .  1  x  number  of 
wds/fec)  milliseconds  has  been  assumed.  If  we  use  the  approximation 


^  I 

^  *-r*  In  n  +  1  (  Yx,  .  58  Euler's  constant), 

j  =  1* 


then  the  expression  for  T  simplifies  to  be 

~  .  03  log2n  (In  n  +  Y)  N -n  (7.  3+  .  5k+  .  03  log?k)  (In  N -In  n) 
InN  +  r  In  N  +  y 

The  total  cost  of  storage  is  approximately 

C  =  pi  (n  +  k)  +  p2  (N  -n), 

where  p  and  p  are  the  costs  per  180  bits  of  core  storage  and  tape 
storage  respectively.  To  reduce  T,  one  can  increase  either  n  or  k 
or  both,  of  course,  at  a  greater  cost.  Since  T  is  nonlinear  with 
respect  to  both  n  and  k,  while  C  is  linear,  speed  is  not  simply  pro¬ 
portional  to  cost.  The  choice  of  n  and  k  must  be  carefully  considered. 

As  this  example  indicates,  the  large  number  of  problems  in  the  area 
of  apportionment  and  selection  of  storage  are  amenable  to  theoretical 
analysis  and  should  be  pursued. 
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B.  A  second  area  where  work  is  required  is  when  the  cost 
of  maintaining  a  sorted  file  is  too  high.  However,  this  does  not 
mean  that  completely  random  storage  has  to  be  resorted  to.  Between 
a  sorted  file(where  order  is  almost  complete)and  a  completely  ran¬ 
dom  file,  there  is  a  wide  spectrum  of  policies  of  filing  and  searching 
which  preserve  most  of  the  structure  and  order  inherent  in  the  in¬ 
formation  to  be  stored  at  a  fraction  of  the  cost  of  maintaining  a 
sorted  file.  The  open  addressing  system  described  in  detail  by 
Peterson  (Ref.  2)  is  an  example  of  such  a  policy.  Another  example 
is  where  items  to  be  stored  can  be  labeled  by  an  originating  date 
(e.  g. ,  news  items,  periodicals).  A  simple  policy  would  be  to  store 
the  items  in  the  order  of  accession  date.  A  knowledge  of  the  dis¬ 
tribution  of  the  gap  between  the  originating  date  and  the  accession  date 
can  be  used  in  designing  search  strategies  which  permit  rapid  access 
to  the  file. 
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Optimum  Solutions  for  Uniform  Distribution 


In  this  appendix,  we  shall  prove  that  the  solutions  of 
Eq.  (8)  are  given  by  Eqs.  (9)  through  (12).  The  proof  proceeds  in 
three  stages.  First,  we  shall  show  that  the  right  hand  side  of  Eq.  (8) 
is  minimized  by  a  specific  set  of  values  of  n.  Next,  the  solution 
for  f(N)  will  be  given.  Finally,  the  multiplicity  of  the  policy  solution 
n(N)  will  be  derived. 

A,  We  begin  by  proving  the  following  theorem: 

Theorem  1,  given  the  conditions  f(0)  =  f ( 1 )  =  0,  the  equation 

f(N)  =  N  +  j  [{(n  -1)  +  f(N  -nj  ,  (8) 

is  satisfied  if  n  =  n*(N)  where 

n*(4m  -2)  =  n*(4m  -1)  =  n*(4m)  =  n#(4m+  1)  =  2m,  m  =  1,2,..  .(A.  1) 
Proof;  It  is  easily  seen  that  the  theorem  iB  equivalent  to  the  following 
set  of  equations  for  m  ranging  over  all  positive  integers: 


f(4m  -2)  =  4m  -  2  +  f(2m  -2)  +  f(2m  -1),  (A.  2a) 

f(4m  -1)  =  4m  -1  +  f(2m  -1)  +  f(2m  -1),  (A.  2b) 

f(4m)  =  4m  +  f(2m  -1)  +  f(2m),  (A.  2c) 

f(4m+  1)  =  4m  +  1  +  f(2m  -1)  =  f(2m+  1),  (A.  2d) 


The  proof  proceeds  by  induction.  First,  it  is  easily  shown  by  enum¬ 
erating  all  possibilities  in  Eq.  (8)  that  Eq.  (A.  2)  holds  for  m  *  1. 

Now  assume  Eq.  (A.  2)  to  hold  for  m  =  1,  2,  .  .  .,K,  it  will  be  shown 
that  the  validity  of  Eq,  (A.  2)  for  m  =  K  +  1  follows.  The  main  part  of 
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the  proof  of  the  theorem  will  need  the  following  Lemma. 

Lemma:  The  validity  of  (A.  2)  for  m  =  1,  2 . K,  implies  the  fol¬ 

lowing  inequalities: 

f(n  +  1)  -  f(n)  >  f(n  -1)  -  f(n  -2),  n  =  2,  3,  ....  4K,  (A.  3a) 

f(n  +  1 )  >  f(n),  n  =  1 ,  2,  . .  . ,  4K,  (A.  3b) 

f(2n)  -  f(2n  -1)  >  f(2n  +  1)  -  f(2n),  n  =  1,  2,  .  . . ,  2K.  (A.  3c) 

If  we  subtract  (A.  2c)  from  (A.  2d)  and  (A.  2a)  from  (A.  2b),  we  see 
that 

f(4m  +  1)  -  f(4m)  *  f(4m  -1)  -  f(4m  -2) 

^  f(4m  - 1)  -  f(4m  -4)  7  .  2<>i  m  *  1,  ...»  K. 

Similarly, 

f(4m)  -  f(4m  -1)>  f(4m  -2)  -  f(4m  -3)  =  f(4m  -4)  -  f(4m  - 5) 

>  f(4m  -6)  -  f(4m  -7)  . . .  ,>o  m  =  2,  . . . ,  K.  (A.  5) 

Thus,  Eqs.  (A.  3a)  and  (A.  3b)  follow  immediately  from  Eqs.  (A.  4)  and 
(A.  5).  Equation  (A.  3c)  can  be  proved  by  indication.  If  Eq.  (A.  2)  is 
valid  for  m  =  1,  .  . . ,  K,  then 

f(2m)  -  f(2m  -1)  >  f(2m  +  1)  -  f(2m),  m  =  1,  2,  . . . ,  K 

implies 

f(4m  -2)  -  f(4m  -3)  >  f(4m  -1)  -  f(4m  -2) 
and 

f(4m)  -  f(4m  -1)  >  f(4m  +  1)  -  f(4m),  m  =  1,  2, . . K 
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Therefore,  the  fact  that  f (2)  -  f(l)  >  f(3)  -  f(2)  implies  Eq.  (A.  3c). 
Now  f(4K  +  2)  can  be  written  by  Eq.  (8)  as 


f(4K  +  2)  =  4K+2-f  l£nT4K+2  -D  4  f(4K+  2  -n)J 

By  symmetry  and  and  Eq.  (A.  3b),  Eq.  (A.  6)  can  be  rewritten 


f(4K+  2)  =  4K+  2  + 


min 

2- n  £4K+  1 


f(n-l)  +  f(4K  +  2 


=  4K+  2  +  min  R  [f(2n  -1)  4  f(4K  +  2  -2n)  ] 

[«2")+  «4K+  1  -2n»]j 


(A.  6) 


(A.  7) 


Now  by  Eq.  (A.  3a),  Eq.  (A.  7)  becomes 


f(4K+  2)  =  4K+  2  min  ^  f(2K+  2)  +  f(2K  -1)  |  ,  |  f(2K)  +  f(2K+  l)] 

=  4K  +  2  +  f(2K)  4  f(2K  +  1).  (A.  8) 

Now,  Eq.  (A.  2a)  becomes  valid  for  m  =  1,  2 . K  i  1,  and  Eq. 

(A.  3a)  is  valid  for  n  =  1,  .  . . ,  4K  +  1 . 

Similarly,  by  the  use  of  Eqs.  (A.  3a)  and  (A.  3b) 
f(4K  +  3)  can  be  written 

f(4K  +  3)  =  4K+ 3  +  min  j^2f(2K+  1),  f(2K)  +  f(2K  4  2)|  .  (A.  9) 

Now  it  follows  from  (A.  3c)  that 

f(2K  4  2)  -  f(2K  4  1)  >  f(2K  4  3)  -  f(2K  4  2), 
and  it  follows  from  Eq.  (A.  3a)  that 

f(2K  4  3)  -  f(2K  4  2)  >  f(2K  4  1)  -  f(2K). 

Therefore, 

f(2K  4  2)  4  f(2K)  >  2f(2K  4  1), 


.„] .  [, 


»} 
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and 


f(4K+  3)  =  4K+  3  +  2f(2K  +  1),  (A.  10) 

thus  extending  Eq.  (A.  2b)  to  m  =  K  +  1. 

By  following  a  procedure  nearly  identical  to  the  above, 
it  can  be  shown  that 

f(4K  +  4)  =  4K  +  4  +  f(2K  +  1)  +  f(2K),  (A.  11) 

and 

f(4K+  5)  =  4K+  5  +  f(2K  +  1)  +  f(2K  +  3).  (A.  12) 

By  induction,  Eq.  (A.  2)  is  valid  for  all  integral  m,  thus  proving 
the  theorem. 

B,  Thus  far,  it  has  only  been  shown  that  certain  choices  of  n  yield 
the  minimum  for  the  right  hand  side  of  Eq.  (8).  Nothing  has  been 
said  about  what  form  f(N)  might  take.  This  will  be  taken  care  of 
by  the  following  theorem; 

Theorem  2;  Equation  (8)  is  satisfied  if  and  only  if 

f(2k+  *  +  2m-l)  =  2k+  *(k  -  -t)  +  2mk  +  3m  +  1,  k«0,l .  (B.  la) 

_  .  _k 

m  ~  0j  If  •  •  •  2  f 

f(2k+  *+  2n)  =  2k  +  *(k  -£  )  +  (2m+  1)  k  +  3m+  3,  ki  0,1,,,,,  (B.  16) 

k 

m  —  0|  1  p  •  •  •  p  2  “1# 

Proof;  The  "only  i£"  (or  uniqueness)  part  of  theorem  2  is  trivial  and 
will  be  taken  care  of  first.  Suppose  there  are  two  solutions  f^(N) 
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and  f,(N).  If  f  >  f  ,  then  f  cannot  be  a  solution  of  Eq.  (8).  Sim- 
b  1  fa  X 

ilarly,  f^  cannot  be  a  solution  if  f ^  >  f^.  Thus,  the  solution  f(N) 
of  Eq.  (8)  must  be  unique. 

To  prove  Eq.  (B.  1),  we  can  again  employ  induction. 
That  is;  we  can  easily  verify  that  Eqs.  (B.  1)  is  valid  for  k  =  0,  in 
addition,  we  assume  Eq.  (B.  1)  to  be  valid  for  k  =  0, 1,  .  . .,  K  -1. 

If,  thereby,  the  validity  of  Eq.  (B.  1)  for  k  =  K  follows,  then  Eq. 
(B.  1)  must  be  valid  for  all  k.  The  complete  proof  is  a  simple  ex¬ 
ercise  in  substituting  Eq.  (B.  1)  in  Eq.  (A.  2)  with  elementary  ma¬ 
nipulations,  and  will  not  be  carried  out  here. 

C,  Theorem  1  is  strengthened  by  the  following  theorem; 
Theorem  3;  Equation  (8)  is  satisfied  by  n  =  n*(N),  i.  e. , 


f(N)  =  N  +  f 


(C.l) 


if  and  only  if 

n*(2k+ !+  2m)  =  2k  +  j,  j  =  0,  1,  . . . ,  2m+  1,  0  ±  m  <  2k  _1  (C.2a) 

j  =  2m  -2k+  1 . 2k,  2k  -1  £  m  £  2k-l, 

n*(2k+1+  2m -1)  =  2k+  2j,  j+  0,  1,  ...,m,  0  £  m  £  2k  _1  (C.  2b) 

.  _  ,k  -1  k  -1  k  -1  ,k 

Proof;  The  "if*  part  of  the  theorem  is  easily  proved  by  simply  sub¬ 
stituting  Eqs.  (C.  2)  and  (B.  1)  in  (G.  1)  and  verify.  In  the  process, 
it  is  also  easily  shown  that  for  2k  +  ^  £  N  6  2k  +  ^-1,  the  only 
solutions  for  the  range  2  £  n#  £  2  are  those  given  by 
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Eq.  (C.  2).  Therefore,  it  only  remains  to  show  that  no  value  of  n* 
k  +  l  k 

greater  than  2  or  less  than  2  satisfies  Eq.  (C.  1). 

Consider  N  =  2k  +  *+  2m,  0  A  m  2k  Since  we 
k 

know  that  n*  =  2  is  a  solution,  we  need  only  to  prove  that  (similar 

k  +  1 

results  follow  for  n#  >  2  by  symmetry) 

f(2k-l)  +  f(2k+  2m)  ^  f(2k-2)  +  f(2k+  2m+  \)£ f(2k-3)  +  f(2k+  2m+  2)*. .  .  (C.  3) 
Inequality  (C.  3)  follows  immediately  from  (A.4)§.  The  proof  for 
N  +  2k+  *+  2m,  2k  *^m  d?2k,  is  equally  simple.  By  using  n*  =  2k+  *, 
it  is  foimd  that 

f(2k+1-l)+  f(2m)  ^f(2k+  *)  +  f(2m  -1)  *  f(2k+  *+  1)  +  f(2m  -2)£...  (C.  4) 

k  +  1 

which  follows  from  (A.  5).  The  proof  for  N  =  2  +  2m  -1  follows 

completely  similar  lines  and  will  not  be  reproduced  here. 


§In  fact,  a  stronger  inequality  with  strictly  unequal  signs  throughout 
follows  from  (A.  4), 


Appendix  C 


Sampling  for  Co-occurrence 

• 

1.  In  business  and  military  intelligence  as  well  as  in  the  retrieval  of 
information  from  documented  knowledge,  it  is  quite  often  necessary  to  deter¬ 
mine  the  number  of  items  that  co-occur  in  two  or  more  lists  of  items.  These 
items  may  be  the  names  of  individuals,  titles  of  documents,  "key  words",  etc. 
The  lists  of  items,  though  finite,  may  often  be  too  long  to  permit  exhaustive 
matching  for  co-occurrence,  without  making  the  cost  of  this  operation  pro¬ 
hibitive.  Thus,  it  is  of  interest  to  explore  the  possibility  of  applying  sampling 
techniques  to  estimate  the  number  of  co-occurring  items. 

As  a  specific  case  of  the  problem  of  co-occurrence,  let  us  suppose 
that  we  have  lists  of  names  of  individuals  on  the  basis  of  their  professional 
specialities  such  as  Mathematics,  Engineering,  Physics,  etc.  In  addition, 
let  us  assume  that  associated  with  each  name  is  a  short  biography.  It  might 
be  of  interest  to  obtain  the  number  of  individuals  in  the  various  lists  who  might 
have  been  working  together  on  some  specific  project  or  might  have  been  at  a 
particular  university  during  a  specified  time  period.  Quite  often,  it  is  only 
necessary  to  obtain  an  estimate  of  this  number  than  its  exact  value.  In  any 
man-machine  information  retrieval  system,  it  may  often  prove  to  be  advan¬ 
tageous  to  obtain  a  reasonable  estimate  of  the  number  of  co-occurrences  before 
a  display  by  cathode  ray  tube  or  any  other  hardware,  since  such  displays  may 
be  very  expensive  and  one  might  want  to  reduce  the  number  of  items  to  be  dis¬ 
played  by  adding  new  requirements.  Thus  in  the  specific  problem  mentioned 
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previously,  if  the  number  of  people  who  worked  on  a  particular  project  are 
too  numerous,  and  the  properties  they  should  have  in  common  other  than 
the  one  specified  are  known,  then  by  requiring  these  further  properties  to 
be  satisfied,  the  size  of  the  collection  of  co-occurring  items  can  be  reduced 
until  at  such  stage  display  is  economically  feasible.  Thus  at  each  stage,  we 
can  obtain  an  estimate  of  the  number  of  items  and  stop  sampling  when  the 
desired  size  of  display  has  been  obtained. 

Having  given  a  rather  brief  account  of  the  reasons  for  sampling  to 
determine  co-occurrences,  we  shall  consider  the  statistical  aspects  of  the 
sampling  problem.  There  are  two  situations  with  which  we  shall  be  concerned. 
They  are: 

1.  The  finite  populations  are  unstratified  and  the  sampling  is  simple 
random  sampling. 

2.  The  finite  populations  are  stratified  and  the  sampling  is  stratified 
sampling.  Stratification  is  economical  in  such  cases  as  alphabetical 
listings. 

This  problem  had  received  some  attention  earlier.  Leo  Goodman  (2) 
discussed  simple  random  sampling  and  derived  the  unbiased  estimator  for  the 
number  of  items  common  to  two  or  more  lists  of  items.  The  approximate  var¬ 
iance  of  the  estimator  for  small  sampling  fractions  had  been  indicated.  We  shall 
obtain  the  exact  moments  of  the  unbiased  estimator.  This  will  enable  us  to  dis¬ 
cuss  the  skewness  and  excess  of  the  distribution  of  the  estimator.  Using  the 
probability  distribution  of  the  estimator  we  can  show  computationally  that  the 
unbiased  estimator  is  indeed  the  maximum  likelihood  estimator.  We  shall  de¬ 
fine  unbiased  esimator  for  stratified  sampling  and  indicate  several  interesting 
aspects  of  this  type  of  estimation. 
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As  shown  by  Goodman,  this  is  one  situation  where  insufficient  statis¬ 
tics  give  minimum  variance  and  unbiasness.  As  he  indicates,  sometimes  the 
estimator  has  unreasonable  values  and  adjustments  have  to  be  made  so  that 
the  nearest  reasonable  estimate  is  used.  This  naturally  will  Introduce  bias. 

We  shall  discuss  the  sampling  problem  with  emphasis  on  pair  wise  co¬ 
occurrence  but  shall  indicate  natural  extensions  to  more  than  a  pair  of  popu¬ 
lations. 

2.  Sampling  Techniques 

Assumptions: 

1)  There  are  r  populations  U^,  U^,  .  .  .  ,  with  N^,  N^,  .  .  , 

N  units  respectively.  The  units  do  not  occur  more  than  once  in 
each  population. 

2)  Samples  of  units  will  be  drawn  from  U^,  U^,  .  .  ,  U^,  without 
replacement. 

3)  Sampling  can  be  done  in  one  or  more  stages. 

4)  Sampling  from  the  populations  or  from  the  strata  is  random 
sampling. 

A.  Simple  Random  Sampling 

Definition  1:  Simple  random  sampling  is  a  method  of  selecting  n 
units  out  of  a  population  of  N  units  such  that  every  one  of  the  (  n)  samples 
has  an  equal  chance  of  being  chosen.  Sometimes  this  type  of  sampling  is 
referred  to  as  unrestricted  random  sampling  in  statistical  literature.  As¬ 
sumptions  1 )  and  2)  hold. 
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Sampling  Scheme:  Simple  random  samples  Sj,  Sj,  .  .  ,  Sr  of  sizes 
nj,  nj,  .  .  ,  n^  are  drawn  from  populations  U^,  .  .  *  Uf  .  The  num¬ 

ber  of  units,  n^  which  samples  S^  and  Sj  have  in  common  with  each  other, 
are  observed  for  i  j,  I  £  i,  j  —  r  . 


Problems:  If  V~  ^  are  the  number  of  units  which  and  Uj  have  in 

A 

common,  we  want  to  estimate  tT*  ^  ,  using  an  estimator  f  ^  which  is 


ij 


a  real  valued  function  of  n^,  n^,  nj,  and  Nj.  The  estimator  has 

to  be  unbiased  and  should  be  a  minimum  variance  unbiased  estimator.  Further, 

A 

the  variance  and  higher  moments  of  have  to  be  derived.  The  coefficients 

of  skewness  and  excess  have  to  be  evaluated  to  study  their  effects  on  the  pred- 
A 

icative  value  of  V” jj  .  Natural  extensions  to  defining  estimators  of  ^  ^  , 

the  number  of  units  which  populations  Uj,  Uj,  ,  .  .  have  in  common  are 

investigated. 


Analysis:  We  shall  make  the  observation  that  ^"~ij*  ^ijk  e*c*  are 
values  or  constants. 


Definition 


A  N,  N, 

■  4J-  • » 


ij 


inj  ij 

A. 

A 


(1) 


rt-  Of  N,  Nj 

Theorem  1:  V  ..  as  defined  by  fii  •  J  n..  is  an  unbiased 

-  ij  a  3  ij 

of  <T ij  .and  in  general  ^  *  T  £  *  • ^  . .  I  is 

stimator  of  <T..  »  .  *  I  » 

lJ  •  •  •  * 


estimator 
unbiased  e 


A  A 

Proof:  We  have  to  show  that  the  expected  vlaue  of  ▼  jj  written  as 
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is  equal  to  0”  . 

We  shall  define  a  function  ^  ..  ae  follows: 

(i»  j) 


S 


■  (i 


if  the  unit  jc  appears  in  both  U.  and  U. 
otherwise  ^ 


In  addition  we  define  a  random  variable 

% 


the  unit  Ot  appears  in  both  and 


otherwise 


1  u 

tf(i.j)  l  0  otl 

Obviously,  (T  xx  =  )  a  where  the  A,  is  over  all  o(  in  both  U.  and  U.. 

ij  ,  m  *^>3)  1  J 

Further  n^  =  £  a ^  ^  ^  where  summation  ^  is  over  all  o<  appearing  in  both 


S.  and  S.. 
i  J 


E  <V  *  ?  E  <  iu.i)  > 

^  ©r(i.  j) 


and  E  1 


Th„.  E(V.2  E( 


’  ’2jLa 


(WftT 


r?)  (?) 


.Ini  y  l 

N.  .  N.  /j  j 
1  J  of 


j) 


=  vJi  . 


i.  e. , 


N.  .  N. 
i  J 


=  cr 


ij 


.  »y  j- 

^  ni  nj  -  J 

Definining  the  function  £  .  o  \ 

ofU.  J . £  J 

for  more  than  two  populations,  we  can  show  that 

E  (nij . I  >  =  E  {  .  i  )  j" 


and  the  random  variable 


t 


•a  ij. 


■y? 

tf(i.j.  ••• 


ft:!) 

(?)  (?)  .  (?) 
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n.  . # •  n 

=  -1.  „,j _ 1 

N.  N,...  N, 
i  3  1 


•  •  •  • 


1 


Thus, 


E 


N.  ...  N 
_J _ 


n.  . . .  n 
J 


n, 

ij 


i 


Theorem  2:  The  variance  of  ,  written  as  Var  (  )  is  equal  to 

rtf  a4-  a  tj  (  *-l  -  Q(  *y  -  0 

0  -t)  ni  *j 


Ki  nJ 


+ 

and  in  general 


—  *  /  *4’  *0 

ij  [ 


(K;  -/)(*;-l)  _  f  J 


flj  •  AJ. 


(4) 


vo.  /  i  .•  ^  "  **JL  __  —  *£  '•  •  dy  (*;-/) 

.  \ 

_ */  (*%-*)(«]*)...(*,’/)  ) 

«J  "  y  **  IW'h^Ui-A  *  ;.4  -  // 
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Proof: 


where  the  21 


is  over  all  pairs  of  random  samples  of  sizes  n^  and 


“J 


Now 


Since  we  have  to  consider  the  occurrence  of  single  units  such  as  and  the 
pairs  such  as  *6,  ^  ,  in  the  samples  in  order  to  evaluate  n^j  .  Since 
and  **/( tj)  are  counted  as  one  ^(ij}  -  £ ££ "  - 
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y  £  (  )  .  ^  as  before.  Further  the  number  of  samples  in 

which  *  alone  occurs  in  &nd  number  of  8amPle*  ln  whlc 

*  and  fi  both  occur  is 


Thus 


<$(*) 


m) 

W" 

Q)($ 

d  a//  a/.  V-  (h-  -  /)(*:  -i)(<r*~  r  j 

-  1  /  xr  j. - — — i — P7-t - U — -V~  * 

*  k,Xj  r0  *  «,  «J  <M,  -  1) 


%  L 


v  ' 


-i'  '  *, 


v.. 

y 


from  theorem  1. 


Thus  Var 


//>.  ,  A{  /t^  4  f/j  ("l 1)  /„.* 

<V  -  rv*  V ■*’# +  •» 


CT. . 

y 


^  ^  V/  V,  (*i~  /)(*;-/) 

Hi  Aj  ^  "  (Nj-l)  (Nj-l)  A,  "j 

<r..1  M  ’  ^  fa  I  _  ,  I 

y  /  W  ~  /)  *i  •  "j  "  J 


The  extension  to  evaluation  of  Var(  ^"ij  .  .  ./  )  °*  *b®  *bove  technique  is 
obvious.  This  completes  the  proof. 
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Corollary  1 


If  -flT  and 


7 

are  small,  then 


*t  7 


(6) 


Corollary  2 

A 

When  n^  ■  and  Nj  ■  nj  ,  Var  (Vy)  ■  0  as  is  to  be  expected. 

Theorem  3:  If  for  all  the  subsets  of  the  set  of  r  documents,  «T  's  are 

defined  and  if  M  is  a  real  valued  function  of  the  Vs, then  there  is  at  most  one 
sample  function  m  such  that  E(m)  =  M  (This  theorem  is  due  to  Goodman), 


Proof:  Excluding  the  empty  set  and  the  sets  of  single  populations,  there 

r 

are  2  -  r  -  1  »  t  ,  subsets  of  the  set  of  r  populations.  Let  these  t 

subsets  be  ordered  and  ranked  so  that  they  recieve  the  ranks  1,  2,  ,  .  ,  ,  t  . 
Let  3^  and  be  the  estimator  and  its  true  value  for  the  subset  which 

receives  the  rank  j  in  the  ordering.  The  estimator  need  not  be  the  particular 
one  we  have  defined  in  previous  theorems.  The  sample  space  consists  of  the 


subsets  f [ *\>  <2 . *t  ]}  of  the  t-dimensional  Euclidian  space.  We 

order  this  subset  by  increasing  values  of  st  .  For  equal  values  of  s^  we 
order  the  vectors  by  st_j  and  so  on,  so  that  for  equal  values  of  *2»  we  order 
the  vectors  of  s^  .  We  may  thus  describe  the  sample  space  as  a  sequence  of 
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vectors 


B j  “  ^(l)*  •  •  *  *t^|/ 

B2  *  *2^’  *  *  '  etC*  where  B1  is 


the  smallest  ordered  vector,  B  the  next  smallest  etc.  To  each  sample 

Ct 

vector,  B  ,  there  is  a  population  vector  P  given  by 

J  ^ 

Pj  *  [<^0)*  )]  where  as  defined 


before  '^(j)  corresponds  to  s^(j)  for  i  •  1,  2,  .  .  ,  t  .  Let  Pr(B^/P^) 

be  the  probability  of  obtaining  sample  vector  B  when  P  is  the  true  popula- 

i  It 

tion  vector.  Evidently  Pr^Bj/PjJ  »  0  if  i>h  ,  and  Pr(B^/Pj)  >  0  for 

all  i  .  Hence  an  unbiased  estimator  m(B^)  of  the  population  function  M(P^), 
must  be  such  that 


m^)  Pr(Bj/Ph)  =  M{Ph)  (7) 

for  h  *  1,  2,  3,  .  .  .  This  necessary  condition  insures  the  uniqueness 
of  m(B^)  since  m(B^)  must  satisfy  the  recursive  formula  (7).  This  com¬ 
pletes  the  proof. 


Theorem 
estimator  of 


ffT.  ■  ~T *//  is  a  minimum  variance  unbiased 

•J  *i  *j  J 

.  (This  is  Goodman's  argument] 


Proof:  The  proofs  for  Theorems  1,  2,  and  3  indicate  that  If  we  want  to 


define  an  unbiased  estimator,  our  best  strategy  is  to  use  9"^  .  This 
A 

means  that  r»  is  indeed  the  minimum  variance  unbiased  estimator. 


1.  Confidence  Limits 


In  computing  the  confidence  limits  for  estimates,  we  usually  assume  that 

the  estimator  has  a  normal  distribution.  We  shall  make  this  assumption  about 

A 

the  probability  distribution  of  tT"  and  shall  later  discuss  in  detail,  the 
magnitude  of  approximation  involved  in  such  an  assumption.  If  the  assumption 
holds,  lower  and  upper  confidence  limits  for  are  given  by 


-  2  V*r  (4}j)  <  ^  +  t  Var  (v^) 


(8) 


where  z  is  the  value  of  the  normal  deviate  corresponding  to  the  desired 
confidence  probability.  The  most  common  values  of  z  are: 


Confidence  probability 


.8 


.9 


.95 


.99 


z  1.28  1.64  1.96  2.  58 

If  sample  size  is  less  than  30  and  n^  *  nj  •  n,  the  probability  points  z 
may  be  taken  from  Student's  t-table  with  (n  -  1)  degrees  of  freedom. 


Thus  in  our  problem  if  — -i-  are  small 


%  ~  J  ».  »; 


<  «Tj  <  S 


ftjj A 

i  +  Z  /"I'V 


^  <r,. 


Thus 


if  — —  ss  — ^L.  «.  A  and  if  I  -  7"..  I  has  to  be  less 

a;-  ^  1  y  xr 
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than  £  with  some  specified  confidence  probability  then 

a*  i.  (9) 

Formula  (9)  gives  the  appropriate  value  of  a  ,  the  sampling  fraction  a. 


2.  Validity  of  the  Normal  Approximation 

"In  the  usual  sampling  problems  sample  means  are  the  estimators  and 
the  finiteness  of  the  second  moment  guarantees,  by  the  Central  Limit  Theorem, 
that  the  sample  means  are  normally  distributed  if  the  populations  are  infinite. 
Madow  (1948)  proved  that  for  a  large  class  of  finite  populations  the  distribution 

n 

of  the  sample  mean  tends  to  normality  even  if  the  sampling  fraction  is  not 
negligible  and  sampling  is  without  replacement.  From  the  study  of  theoretical 
distributions  that  are  skewed  and  from  the  results  of  sampling  experiments  on 
actual  skewed  populations  some  statements  can  be  made  about  what  usually 
happens  to  confidence  probabilities  when  we  sample  from  positively  skew  popu¬ 
lations.  The  sample  size  is  as  Burned  large  enough  so  that  the  distribution  of  the 
estimator  shows  some  approach  to  normality.  The  statements  are  as  follows: 


(1)  The  frequency  with  which  the  assertion  estimate  -  (1.96)  s.  d.  (esti¬ 
mator)  <  (true  value)  <  estimate  -  (1.96)  s.  d.  (estimator,  is  wrong, 
is  usually  slightly  higher  than  5%  . 

(2)  The  frequency  with  which  (true  value)  >  estimate  1.  96  s.  d.  (esti¬ 
mator)  is  greater  than  2.  5%  . 
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(3)  The  frequency  with  which  (true  value)  <  estimate  -  1.96 
s.d.  (estimator)  is  less  than  2.  5%  . 

Here  s.  d.  stands  for  standard  deviation."  Pochran] 


If  die  population  is  negatively  skewed  statement  (1)  will  still  be  true 
but  in  (2)  the  frequency  will  be  less  than  2.  5%  and  in  (3)  the  frequency  will 
be  greater  than  2.  5%  .  Since  any  standard  book  on  sampling  theory  gives  an 
excellent  discussion  on  skewness  and  its  effect  on  confidence  statement,  we 
shall  not  elaborate  on  this  point  any  further. 

We  shall  show  that  the  distribution  v  ^  could  be  both  positively  and 

A 

negatively  skewed  even  though  most  frequently  in  practical  situations  is 

positively  skewed.  It  is  worthwhile  to  emphasize  the  fact  that  if  we  are  only 
interested  in  the  absolute  value  of  the  error  of  an  estimate,  then  a  fair  amount 
of  skewness  in  the  distribution  of  the  estimator  can  be  tolerated.  However,  if 
we  want  to  make  a  confidence  statement,  the  normal  approximation  is  not  trust¬ 
worthy  unless  very  little  skewness  remains  in  the  distribution  of  the  estimator. 


There  is  no  general  rule  as  to  how  large  the  sample  size  must  be  for  the 
use  of  normal  approximation.  For  populations  in  which  the  deviation  from  norm¬ 
ality  consists  mainly  of  marked  positive  skewness,  a  practical  rule  due  to  W.  G. 
Cochran  is  as  follows: 


where 


2 

Sample  size  >  •  25  G  j 


G  ■  Central  Moment) 

1  (Second  Central  Moment) 


(10) 
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This  rule  is  designed  so  that  a  95%  confidence  probability  statement  will 
be  wrong  not  more  than  6%  of  the  time. 

A 

3.  Skewness  in  the  distribution  of  0"  -j 


Theorem  5:  Sampling  fractions  being  small  the  probability  distribution 
A 

of  (j~ ij  is  positively  skewed  or  negatively  skewed  according  to  whether 


positive  or  negative  respectively. 


Proof: 


(12) 
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where  L  ,  and  /  >  /  •  •«  are  defined  as  in  the  proof  of  Theorem 

2-  L 


<,jS,  t  (i.j) 


is  defined  as  follows: 


1  if  units  ,  /  and  r  appear  in  both  and 
0  otherwise. 


Y  l  M  .  .  . .  has  to  be  taken  twice  since  the  order  in  which  and 

•«./(,. j i 

jS  are  taken  gives  rise  to  different  terms.  Equation  (12)  on  simplification 
gives 

(  .  .  j\  J  ....  2 

-0 


r  /  *  >\  If A 1  ,  / */,)*(*,  -  o  (*j  - 

£(^j)  •  + 3  [»!*,/ rf-') (4 ~ 


0  *7j<r(J-t) 


+  [*iy)  -  t)(Jf  -  i)(jij  -0(4’*) 


J 


We  have  already  evaluated  E(^.  *  )  and  E(  < )  .  Thus 

J  J 

^  lrr,] 


E(*j-  *j) 


tjj\  (*i  ~f)^i ~^X/ty  -tXrt/'-l)  f  t\f*- 

Wy) *5 (*r,itrr*\ 

s'  *9  *  wq-i)rtj(fifj-i) 


when 


>h_ 

Al¬ 


ar  e  small.  Equation  (14)  gives 


and 


Nj 


ELty-Vj)3  =  H  -  J 


(15) 


A  ^ 

From  equation  (15)  when  £  (fit  )  ^  ^ 

y  y 


the  distribution  of  (/y--  has 


positive  skewness  and  if  0  then  distribution  of  (jr.  has 

y  /  y 

negative  skewness.  This  is  equivalent  to  the  assertion  in  the  Theorem. 

Note:  When  and  are  not  small  equation  (14)  must  be  used  to 

-  N{  rJj 

determine  skewness.  In  the  extreme  case  when  K/s  N/  and  nj  ~  Nj  ,  it  is 
easy  to  verify  that  Eq.  (14)  reduces  to  zero  as  is  to  be  expected. 


Corollary  1:  When  and  are  small  the  coefficient  of  skewness, 

-  Ai*  A  "j 

an  of  0?. 

MN/  )f 


n; 


Gj,  of  the  distribution  of  C7; 


7 y/z 


(16) 


Corollary  2:  When  77*  ,  i  •  1,  2,  ,r  .  are  small,  the  coefficient  of  skewness 

lYt 

A 

of  the  distribution  of  ^  is  given  by  Gj 

imated  Variance 


4.  Effect  of  Non-Normality  on  Estimated 


We  have  derived  earlier  the  variance  of  the  estimator  Cjj 


The 


expression  for  the  variance  is  in  terms  of  ffy  ,  the  population  value.  In 


practice,  when  the  variance  has  to  be  computed  using  sample  values,  it  is 


important  to  investigate  the  effect  of  non-normality  of  the  distribution  of 
A  | 

on  the  computed  variance,  say  S  .  One  such  effect  is  that  the 

a 

estimated  variance  5  may  be  more  highly  variable  from  sample  (pair) 
to  sample  (pair)  than  we  expect,  if  we  assume  that  we  are  sampling  from  a 
normal  distribution.  For  any  infinite  population  the  variance  of  estimated 


variance  0  in  tn  repeated  samples  (pairs),  is 

\*i* 


Variance  (estimated  variance) 


v7  js 

m  - 1  m 


(17) 


The  first  term  on  the  right  hand  side  of  Eq.  (17)  is  the  value  which  the 

I 

variance  of  o  has  when  the  parent  distribution  is  normal.  The  second 
term  represents  the  effect  of  non-normality.  The  quantity  K ^  is  Fisher's 
fourth  cumulant  (Fisher,  1932)  and  is  given  by 


The  skewness  in  the  distribution  as  measured  by  (7 f  does  not  affect  the 

stability  of  J  .  The  important  factor  is  the  fourth  moment  in  the  population. 
The  cumulant  Ay  is  zero  for  a  normal  distribution.  It  may  take  positive  or 
negative  values  in  other  distributions  but  in  those  encountered  in  sampling 
practice  Ay  appears  to  be  positive  much  more  often  than  negative  and  may 
have  a  high  value  for  some  parent  distributions. 


Variance  (Sl)  • 


m  -/ 


(19) 
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where  /jv ar  <yr  *■  Fisher's  measure  of  kurtosis 

f  fn~ /  .  * 

''  ^ factor  by  which  the  variance  of  S  is  inflated  due  to 

non-normality.  Since  this  factor  is  almost  independent  of  in  ,  the 

inflation  remains  even  with  large  m  . 


We  shall  determine  for  .  For  this  we  have  to  determine 

-  y' 

£lh~  y'-  «•%£(*/)*  <■  ^  ‘b  y  v 

-  ir9'E(%)*  *7/ 


)  can  computed  using  the  &  -functions  in  the  same  way  as 
the  first,  second  and  third  moments.  We  have  to  define  an  additional 
i  -function  as  follows: 


eu.j) 


Then 


if  elements  t  /3  , 
both  and  Sj  . 


r,  and  £  appear  in 


otherwise. 


(i-i) (?:.')  ■ 


number  of  pairs  of  samples  in  which  a  particular 
unit,  say  *C  ,  appears. 


«  number  of  pairs  of  samples  in  which  two  particular 
units,  say  and  J  ,  appear. 


I  _j  )  (*■  3)  “  num^er  °f  Pa*rs  of  samples  in  which  three  particular 
'  ‘  '  '  units,  say  ,  jS  ,  and  r  ,  appear. 

(ty  -V  \/ //•-  V\  number  of  pairs  of  samples  in  which  four  particular 
/t; /iy -V/  =  units«  say  ,  ytf  ,  r  ,  and  J  ,  appear. 

The  coefficients  14,  36  and  24  are  obtained  as  follows:  Since  we  are  taking 


nH  ’  we  ^iave  to  obtain  the  multinomial  coefficients  for  2  units,  3  units. 


and  4  units.  For  2  units,  we  take  JjT&T  f°r  terms  like  *  A  1  . 

.  -yf  ,  3 

C  J7  /  f  f°r  terms  like  <*s  and  giving  a  total  of  14  .  For  3  units, 

iff  111 

we  have  to  take  3  &  f  /  f  /  /  "  36  for  terms  like  #6 ^ rK  »  / 


75. 


For  4  units  we  have 


Thus 


/  •'/'  /  '  /' 


£{<r.  *>  -  n^",-'),J(n1-h  d-,Ar..,) 

/  I",  <„•  /  (  ^  11  /<  (H.-dNjlNj-,)  ~l~t> 

'3<-  nl(nri)(*i-z)HJ(*j-i)(rtl-i)  tt.  ( <Ti; -/)(<%,- -z) 

-iXtij-i)  3  f 

l4_  i^(»l-/Xflrz)(n--3)k/nJ-lX»j.iX^  -3)  ^-Yc£.-/Yc£  -l)^. -j) 

*  /7 — " — 

QOOf^.  „!KK\S 

"  rf^7  ^  *  7  'V,J 

.  (  /KrfX  </'i-'X*i-i'Xn,-i)(*;-t.)  .  , 

*  4  (m7  mtwptir*  ^j-,}(<rri) 


I  J  I  *  r  —  1  - /  j  /,Ty  ^  y  v. 

*WjJ  <K- >Xy -  lU, -jtfr'Xrfj -iXrtj-i)  Vj&ij'Wj-iXoy-J) 

It  is  easy  to  verify  that  when  nA  »  N.  and  nj  -  Nj  ,  E(^j  *)  -  fT.  ^ 

Now  using  the  values  of  E(ty)  ,  e(£j  *)  a.dE(4)  computed 

earlier,  we  have 

E(rr  V  •  kp  ry  *  7  M  (ZRopT 
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(*  (*ii-/Xnt-z)(ni-  s)(  nj-tXrij  - 1)(  nf  -  J )  __  y  4/  w  4 

*  tvf/  U-'M/-  *XA ';-3Xty-,)(*fri)(.Kr3)  *g(*}-*x*#’*X*jr-*' 

-4r^)  ^  -  «  r^7  i*-)<4:>r  ^ 

/ ^  a;-  ]z  ^ «/  -  /  V 4/ -  *X/y- /)(*,  -  i)  , 

( *,  nj)  (*;-')(/% -tX/Zj-i)  <%/  ' 

..  / ^  K\  —  <»  ,  (*/-/)( ft;  -t) 

^y/  %  * 6  T/ii-iiidj-i) 


*h  -/)  - 


*  # 

Notice  that  when  4;  •  N(  and  njm  ,  &(  ^  )  •  0  as  is  to 

be  expected. 

y|  y 

The  extension  of  the  above  proof  to  £K../  ~  V  is  obvious. 

J‘ 

A  jf 

Theorem  6:  When  and  are  small  £(«$-  Vs 

In  general  »h«n  -2- . are  amall  £($j-  /y,_ 

The  proof  of  this  theorem  follows  from  (20). 

f€  '  H* 

Theorem  7:  When  — and — are  small,  the  coefficient  of  kurtosis 
~  ^ 

G,  of  the  distribution  of  fTy  m  (*'4)-L  -  J  and  in  general  the  coefficient  of 

2  J  \**71  v# 

a  J  */■  /J-  rft  /  » 

kurtosis  G,  of  the  distribution  of  <57  *  — J. LlL  *  -y  j  3  • 

2  y-*  «,'*/•••  y 


Proof: 


Mr  vt 


by  definition. 


Thus 


°y 


(±±)\ 

*  v«‘ «/- 


-  3  r 


G,  -  -M%^L  1 


£<%-i 


r.. 

— V-1 


<r.. 

IT 


±%  jL 
*V  *9 


-  3 


w 


-  3 


s  'L  i. 


», .■  ”j 


-  3 


** 

5.  Observations  about  the  skewness  and  kurtosis  of  the  distribution  of 


When  sampling  fractions  are  small  the  coefficient  of  skewness  of  Gj 

of  the  distribution  of  is  approximately •  Th*8  indicates 

that  for  small  values  of  ,  the  skewness  is  much  pronounced  unless  sampling 

fractions  are  made  large,  in  which  case  the  exact  formula  for  has  to  be 

used  to  compute  the  skewness.  Except  for  documents  that  have  nothing  in 
fj-  */• 

common,  1  I  is  an  upperbound  for  skewness  when  sampling  fractions 

" inJ 

are  small.  The  coefficient  of  excess  or  kurtosis  G_  for  small  sampling  frac- 

tk*j  / 

tions  varies  as  H;  P jj~ *  ^  •  Here  again  when  <T4y  is  very  small  G£  is 
large  and  for  all  pairs  of  documents  which  have  at  least  one  unit  in  common; 

Hi*; 


A:  ft, 


*y 


*3  is  an  upper  bound  for  G,  . 
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The  probability  distribution  of  n 


ij 


N.N. 

— — — is 
ni„j  T 


In  order  to  show  by  computation  that  the  unbiased  estimator 
indeed  the  maximum  likelihood  estimator,  we  have  to  eval¬ 


uate  the  probability  for  n„.  In  order  that  the  two  samples  of  sizes 

n.  and  n.  should  have  n. .  =  a  co-occurrences,  these  items  should  be 
i  J  U 

included  in  the  samples.  Let  N  be  the  number  of  items  that  co-occur 

N 

in  the  populations  U.  and  U.,  then  any  ( a)  could  be  included  in  the  two 
samples  and  the  remaining  n^  -a  and  m  -a  items  of  the  two  populations, 
respectively,  so  that  none  of  the  common  items  appear  among  both 
(n.  -a)  and  (n.  -a)  items  of  the  two  samples. 

Thus,  the  probability  that  n„  =  a  can  be  easily  seen  to  be 


N  -a 

p  (n--< =  a)  =  2 


u 


y  =  0 


/N\  (N-ax  /Ni  "N  »  (Nj  -a  ~y  \ 

'Y  >  -a  V  '  ii-  -a  ' 


An  alternative  method  of  derivation  of  P  (n„  =  a)  will  be  briefly  dis¬ 
cussed  below.  The  derivation  is  based  on  a  theorem  concerning  the 
realization  of  m  among  hj  events.  For  a  more  detailed  reading  Chap¬ 
ter  IV  of  W.  Feller' 8  "An  Introduction  to  Probability  Theory  and  its 
Applications"  forms  an  excellent  reference.  If  Aj,  A^,  .  .  .,  A^are 
any  tj  events,  not  necessarily  mutually  exclusive,  then  P  ^  the 
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probability  that  exactly  a  among  the  N  events  occur  can  be  expressed 


in  terms  of  S  ,  S  ,  S.T  where  S  =  P  (A . .  A  )  where 

aa+1  N  a  '1  a 

P  (A  ,  ....  A  )  is  the  probability  of  joint  occurence  of  A  .....  A  . 

1  4  1  & 

In  essence,  the  theorem  asserts  that 

p,.  =  s  -<a»  1  )  s  +  1  +(a^2)  s  +...+  (W)  S 

(a)  a  a  a+1  a  a  +  2  *  N 

In  the  context  of  our  problem,  is  the  probability  that 
any  a  of  the  common  items  will  appear  in  both  the  samples.  This  is 
obviously  (^)  (”x)  (£j) 


(?i)  (?J) 


Thus: 


fv-J ' 


PCaJ  "  <?>  CO  -  <\+  ‘HaPlCi  l>$  1> 


,N i  ,  N; 

(a^l)  (a+\) 


+  (aa+2>  (a+zHa  +  zKalz)^  +  <£> 


<Ni  >  (V) 

a+2  a+2 


(Mi)  (Nj) 
N  N 
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/N.  -  r  WN.  -  r  \ 

K  -Ow 


Where  af  ,  * . a<  are  items  in  the 

l  £  r 

populations. 


(s)  ( g j  / g\ 

and  where  D  (r;a  ,  a  .  .  .  .  ,  a  )  is  the  denumerant  introducted  by  Sylvester 

1  4  S 

for  partitions.  (****+  ) 

(b)  (s)  (s) 

Thus,  D  (r;a  ,  a  . a  ')  denotes  the  number  of  partitions  of  r 

\  dt  8 

(s)  (s)  ( s ) 

into  parts  a  ,  a  ,  .  .  .  ,  a  which  is  the  same  as  the  number  of  solu- 
1  C  8 

(s)  (s)  (s) 

tions  in  integers  of  a,  x,  +  a„  x_  +  .  .  .  +  a  '  x  =  r. 

112  2  s  s 

{  g  \  /  g  \ 

The  generating  function  for  this  is  D  (t;a  ,  .  .  a  ) 

1  s 

V*  n  It-  *  -  (SK  fr 

D  (r:aj  ,  .  .  .  ag  )  t 


_ 1 _ 

,  (s)  a  (s) 

a  a 

(1  -  t  1  )  (1  -  t  Z  )....(! 
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N.  N. 


Computational  verification  of  the  fact  that  1  J  .  n 


n  « 

i  j 


ij 


is  the  maximum  likelihood  estimate  of  Q~  . 


We  are  interested  in  finding  that  value  of  (J?  which  maximizes 
the  probability  that  the  random  variable  x  has  an  observed  value 
n^.  In  one  sample  problem  such  as  the  one  discussed  by  Feller  (4), 
the  calculations  on  the  basis  of  sampling  with  replacement  in  two 
stages  reduce  the  determination  of  the  maximum  likely  number  of 
items  to  some  fairly  simple  inequalities  in  terms  of  certain  observed 
values.  But  in  the  two  sample  cases,  any  similar  calculations  are 
very  difficult  and  as  such  we  shall  use  a  computational  scheme. 

Here  we  shall  only  include  the  tabulation  of  j?  (x)  for  =  100, 

n  =  n.  »  50,  (T  =  4  (4)  48  and  for  N  =  N  =  200  n  =  n  =  100 
J  J  *  J  *  J 

(T^  =  4  (4)  40.  These  tabulations  are  sufficient  to  indicate  that 


N.N 


IN 

i  j  •  n, ,  is  the  maximum  likelihood  estimate  of  Q ...  When 

-r-  ^  ij 


ninj 


r  n.  n.  -j 

[r=f  ■»] 


should  be  used 


N.  1ST.  .  „  .  . 

i  j  is  not  an  intege 

n  n 

i  j 

as  the  maximum  likelihood  estimate  of  (T^j*  The  tabulation  is 
self-explanatory. 
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Tabulation  of  the  probability  distribution  of  co -occurrences  when  the 
sampling  fractions  are  each  1/2. 

Nj  =  #  of  items  in  the  1st  population 

=  #  of  items  in  the  2nd  population 
N  =  #  of  items  common  to  or  co-occurring  in  both  populations 

N  N 

1  ,  2  are  the  sizes  of  samples 

2  2 

X  =  observed  #  of  co-occurring  items  in  samples 

P  (N)  =  Probability  of  obtaining  x  co-occurrences  when  there  are  N 
common  items  in  the  two  populations. 

For  various  values  of  N,  the  maximum  likelihood  value  x  and  the 
corresponding  probability  are  underlined. 


N1 

N2 

N 

X 

PXINI 

100 

100 

4 

0 

0.31214111 

ICO 

100 

4 

J_ 

Q.4275B369 

100 

100 

4 

2 

0.21104912 

100 

100 

4 

3 

0.04497169 

100 

ICO 

4 

4 

0.00344940 

ICO 

100 

8 

0 

0.09383162 

ICO 

100 

8 

i 

0.26683131 

100 

100 

8 

.QtttSQJQH 

100 

100 

8 

3 

0.21137340 

ICO 

100 

8 

4 

0.00401038 

100 

100 

8 

3 

0.02057208 

100 

100 

8 

6 

0.00302837 

100 

100 

8 

7 

0.00024489 

ICO 

100 

8 

8 

0.00000832 

100 

100 

12 

0 

0.02708941 

100 

100 

12 

1 

0.12015590 

100 

100 

12 

2 

0.23534560 

ICO 

100 

12 

3 

0.26903415 

100 

100 

12 

v~ 

0.19980867 

too 

100 

12 

5 

0.10151327 

100 

100 

12 

6 

0.03615530 

ICO 

100 

12 

7 

0.00909021 

100 

100 

12 

8 

0.00160017 

100 

100 

12 

9 

0.00019220 

100 

100 

12 

10 

0.00001494 

100 

100 

12 

11 

0.00000067 

100 

100 

12 

12 

0.00000001 

too 

100 

18 

0 

0.00748815 

100 

loo 

16 

1 

q. 04612894 

ICO 

100 

16 

2 

0.12825721 

100 

100 

16 

3 

0.21354941 

100 

100 

16 

4 

0. 23819625, 
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Mi 

AS 

N 

X 

w 

AS. 

N 

X 

£00  • 

100 

ICO 

16 

5 

0.  18862925 

ICO 

100 

28 

3 

0.02919267 

ICO 

100 

16 

6 

0.1C966165 

100 

100 

28 

6 

0.07120921 

100 

100 

16 

7 

0.06  768691 

100 

100 

28 

5 

0.12798672 

100 

ICO 

16 

8 

0.0156  7638 

100 

100 

28 

6 

0.17616900 

100 

too 

16 

9 

0.00390386 

100 

too 

28 

T. 

0.19068589 

ICO 

100 

16 

10 

0.00073375 

100 

100 

28 

8 

0.16677585 

100 

100 

16 

11 

0.0001029C 

100 

ICO 

28 

9 

0.11568816 

100 

ICO 

16 

12 

O.OCOQ 1054 

100 

too 

28 

10 

0.0661 902 1 

100 

100 

16 

13 

0.00000076 

ICO 

100 

28 

11 

0.03122678 

ICO 

100 

16 

16 

O.OOOCCCC3 

too 

100 

28 

12 

0.01217706 

100 

100 

16 

15 

0.00000000 

100 

lOO 

28 

13 

0.00393627 

100 

ICO 

16 

16 

o.ocoooooo 

100 

100 

28 

16 

0.00105583 

100 

ICO 

20 

0 

0.00197506 

ICO 

100 

28 

15 

0.00023692 

ICO 

100 

20 

1 

0.01587299 

100 

too 

28 

16 

0.00006328 

100 

100 

20 

2 

0.05830571 

100 

ICO 

28 

17 

0.00000658 

100 

ICO 

20 

3 

0.  130C9760 

too 

100 

28 

18 

0.00000082 

100 

ICO 

20 

6 

0.  19766336 

100 

100 

28 

19 

O.OOOCOCOB 

ICO 

100 

20 

5 

0.21725863 

100 

100 

28 

20 

O.OOOCOOOO 

100 

100 

20 

6 

0.17916570 

100 

ICO 

28 

21 

0.00000000 

100 

ICO 

20 

7 

0.11361063 

100 

100 

28 

22 

0.00000000 

100 

ICO 

20 

8 

0.05596671 

100 

100 

28 

23 

o.ooccccod 

ICO 

100 

20 

9 

0.0217C372 

100 

100 

28 

26 

o.ooocoooo 

100 

100 

20 

10 

0.00665252 

100 

100 

28 

25 

0.00000000 

too 

100 

20 

11 

0.00161269 

100 

100 

28 

26 

0.00000000 

100 

100 

20 

12 

0.00030839 

100 

ICO 

28 

27 

o.ooooccoo 

ICO 

ICO 

20 

13 

O.OOOC6627 

100 

100 

28 

28 

0.00000000 

100 

100 

20 

16 

0.00000537 

ICO 

100 

32 

0 

0.00002623 

100 

100 

20 

15 

0.00000067 

100 

100 

32 

1 

0.00038896 

100 

ICO 

20 

16 

0.00000003 

100 

100 

32 

2 

0.00268119 

100 

100 

20 

17 

O.OCOCCCOO 

100 

100 

32 

3 

0.01166083 

100 

100 

20 

18 

0.00000000 

100 

ICO 

32 

6 

0.03396666 

100 

ICO 

20 

19 

o.ooocoooo 

100 

100 

32 

5 

0.07657060 

100 

ICO 

20 

20 

0.00000000 

ICO 

100 

32 

6 

0.126123C9 

ICO 

100 

26 

0 

0.00069513 

100 

100 

32 

7 

0.16859676 

100 

100 

26 

1 

0.00699659 

100 

100 

32 

±- 

0.18165308 

100 

100 

26 

2 

0.02321275 

100 

ICO 

32 

9 

0.15937337 

100 

ICO 

26 

3 

0.06611821 

too 

100 

32 

10 

0.11538137 

ICO 

100 

26 

6 

0.1295175C 

100 

100 

32 

11 

0.06936508 

100 

100 

26 

5 

0.18559521 

100 

ICO 

32 

12 

0.C368 1675 

100 

ICO 

26 

fLm 

0.  2C2C6028 

too 

100 

32 

13 

0.01666686 

100 

ICO 

26 

7 

0.17128365 

ICO 

100 

32 

16 

0.00517721 

100 

100 

26 

8 

0.11501052 

too 

100 

32 

15 

0.00153968 

100 

100 

26 

9 

0.06187926 

100 

ICO 

32 

16 

0.00038537 

100 

ICO 

26 

10 

0.02688570 

100 

100 

32 

17 

0.00008111 

100 

ICO 

26 

11 

0.0096  7900 

100 

100 

32 

18 

0.00001633 

ICO 

100 

26 

12 

0.00271828 

100 

100 

32 

19 

0.00000212 

100 

100 

26 

13 

0.00063616 

100 

100 

32 

20 

0.00000026 

100 

100 

26 

16 

0.00012010 

100 

100 

32 

21 

0.00000002 

100 

ICO 

26 

15 

0.00001839 

100 

100 

32 

22 

O.OOOCOOOO 

ICO 

100 

26 

16 

O.OOOCC226 

100 

100 

32 

23 

0.00000000 

100 

100 

26 

17 

0.00000022 

100 

ICO 

32 

26 

0.00000000 

100 

ICO 

26 

18 

0.0C000C01 

too 

ICO 

32 

25 

0.00000000 

100 

ICO 

26 

19 

0.00000000 

100 

100 

32 

26 

0.00000000 

ICO 

100 

26 

20 

o.occcocoo 

100 

100 

32 

27 

0.00000000 

100 

100 

26 

21 

0.00000000 

100 

100 

32 

28 

0.00000000 

100 

ICO 

26 

22 

o.ocoooooo 

100 

too 

32 

29 

0.00000000 

100 

ICO 

26 

23 

0.00000000 

100 

100 

32 

30 

o.ooocoooo 

ICO 

100 

26 

26 

o.ooocoooo 

100 

100 

32 

31 

0.00000000 

100 

100 

28 

0 

0.00011766 

100 

ICO 

32 

32 

0.00000000 

100 

100 

28 

1 

0.00166937 

100 

100 

36 

0 

0.00000568 

100 

ICO 

28 

2 

0.00828632 

100 

100 

36 

1 

0.00009667 
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V 

X 

N 

y 

l(X) 

100 

100 

36 

2 

0.00079096 

100 

100 

60 

26 

0.00000000 

100 

100 

36 

3 

0.00602561 

too 

100 

60 

27 

o.ooocoooo 

ICO 

100 

36 

6 

0.01629017 

100 

100 

60 

28 

0.00000000 

100 

100 

36 

5 

0.03768762 

100 

100 

60 

29 

0.00000000 

100 

100 

36 

6 

0.07680590 

too 

100 

60 

30 

0.00000000 

100 

100 

36 

7 

0.12621556 

ICO 

100 

60 

31 

0.00000000 

ICO 

100 

36 

8 

0.16266317 

100 

too 

60 

32 

0.00000000 

100 

100 

36 

9 

_._0. 17625  l  79 

100 

100 

60 

33 

0.00000000 

100 

100 

36 

nr* 

0.  156876T9 

100 

100 

60 

36 

0.00000000 

too 

100 

36 

u 

0. 11698225 

ICO 

too 

60 

35 

o.ooocoooo 

100 

100 

36 

12 

0.07173630 

100 

100 

60 

36 

0.00000000 

ICO 

100 

36 

13 

0.03777776 

100 

ICO 

60 

37 

0.00000000 

100 

100 

36 

16 

0.01686935 

100 

too 

60 

38 

0. OOOOOOOG 

100 

ICO 

36 

15 

0.00637883 

ICO 

100 

60 

39 

o.ooocoooo 

ICO 

100 

36 

16 

0.00205251 

100 

100 

60 

60 

0.00000000 

ICO 

100 

36 

17 

0.00056161 

100 

100 

66 

0 

0.00000019 

100 

100 

36 

18 

0.00013066 

100 

100 

66 

1 

0.00000662 

100 

100 

36 

19 

0.00002581 

100 

100 

66 

2 

0.00005218 

ICO 

100 

36 

20 

0.00000632 

100 

100 

66 

3 

0.00036699 

100 

100 

36 

21 

0.00000061 

100 

100 

66 

6 

0.00180650 

100 

100 

36 

22 

0.00000007 

100 

100 

66 

5 

0.00663222 

100 

100 

36 

23 

0.00000000 

100 

100 

66 

6 

0.01889621 

ICO 

100 

36 

26 

0.00000000 

100 

100 

66 

7 

0.06292551 

100 

100 

36 

25 

0.00000000 

100 

100 

66 

8 

0.07926786 

100 

100 

36 

26 

0.00000000 

100 

ICO 

66 

9 

0.12072065 

100 

100 

36 

27 

0.00000000 

100 

100 

66 

10 

0.15330181 

ICO 

100 

36 

28 

0.00000000 

100 

100 

66 

-11 

0.1637L922 

100 

100 

36 

29 

0.00000000 

100 

too 

66 

TF 

0.  168C2530 

100 

100 

36 

30 

0.00000000 

100 

ICO 

66 

13 

0.11390086 

100 

100 

36 

31 

0.00000000 

ICO 

100 

66 

16 

0.07689260 

ICO 

100 

36 

32 

0.00000000 

100 

100 

66 

15 

0.06221096 

ICO 

100 

36 

33 

0.00000000 

100 

ICO 

66 

16 

0.02066027 

100 

100 

36 

36 

0.00000000 

too 

100 

66 

17 

0.00851767 

100 

ICO 

36 

35 

0.00000000 

100 

100 

66 

18 

0.00305761 

ICO 

100 

36 

36 

0.00000000 

100 

100 

66 

19 

0.00096572 

ICO 

100 

60 

0 

0.00000107 

100 

100 

66 

20 

0.00028205 

100 

100 

60 

1 

0.00002205 

100 

100 

6* 

21 

0.00005785 

100 

ICO 

60 

2 

0.00021289 

ICO 

100 

66 

22 

O.OOOC 1161 

ICO 

100 

60 

3 

0.00127856 

100 

100 

66 

23 

0.00000193 

ICO 

100 
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Limiting  Probability  Distribution 


Case  I. 


c. 

1 


and 


n. 

J 

"I 


as  n^, 


nj’ 


N.,  N._>V> 


Let  N  be  the  number  of  common  items  in  the  two  populations  and 


let  n. .  =  a. 
ij 


?  j  n’j  -  aj 


x"*  (?)(7) 


89.. 


(<")(r)  -  cxv) 


ft-rvct. 


(N  \(  <ur\  *1/  (a  +  t)! 

\a*r  J\  «.  J  ~ (a+t)!  (N-*-r)J  O'!  r! 


_  V/  0,1  (N-a,)  {  (atf'jl 

al  (N-&)!  ( H-*-r)!(*+r)J  */  r( 


Case  2  Since  case  1  leads  to  a  binomial  distribution,  we  can 
obtain  a  Poisson  distribution  with  the  added  conditions  that 


Np  =  *  j  ,  N  tend  to  ^  as  N  — ^  and  p  =  ni  nj 


NiNj 


NiNJ 


90. 


% 


US 


(2)f0-» 


H-*. 


Note:  The  maximum  likelihood  property  of 


NiNj 

Vi 


the  estimate  for  N  is 


obvious  in  the  two  limiting  distributions. 
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STRATIFIED  SAMPLING 


Assumptions: 

(1)  As  before. 

(2)  As  before. 

(3)  Each  population  is  stratified  into  L  non-overlapping  strata  so 

that  N.  =  +  n/  ^  +  .  .  .  +  and  n  represents 

11  1  th  1  1 

the  number  of  units  in  the  h  stratum  in  U.  . 

l 

(4)  Sample  of  size  m  drawn  from  population  U.  comes  from  the  L 

strata  so  that  n.  =  n.^  +  n.^  +  .  .  .  +  n.  ^  ancj  n  8tan(js 

11  1  th  1  1 

for  the  portion  of  the  sample  from  h  stratum  in  U.  . 

A,  Main  reasons  for  stratification 

"(a)  If  data  of  known  precision  are  wanted  for  certain  subdivisions  of 
the  population,  it  is  advisable  to  treat  each  subdivision  as  a  "population"  in  its 
own  right. 

(b)  Administrative  convenience  may  dictate  the  use  of  stratification 
e.g.  an  agency  conducting  a  survey  may  have  field  offices,  each  of  which  can 
supervise  the  survey  for  a  part  of  the  population. 

(c)  There  may  be  a  marked  difference  in  the  sampling  problems  in 
different  parts  of  the  population. 

(d)  Stratification  may  bring  about  a  gain  in  precision  in  the  estimates  of 
characteristics  of  the  whole  population.  The  basic  idea  is  that  it  may  be  pos¬ 
sible  to  divide  a  heterogeneous  population  into  subpopulations,  each  of  which 
is  internally  homogeneous.  If  each  stratum  is  homogeneous,  in  that  units 
vary  very  little  from  one  another,  a  precise  estimate  of  any  stratum  value 
can  be  obtained  from  a  small  sample  in  that  stratum."  [  Cochran  J 
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B,  Sampling  scheme:  In  stratified  sampling  the  population  of  units 
(/- i  -r)  is  divided  into  subpopulations  so 


that  N. 


N.^  +  N  ^  +  .  .  .  +  .  The  subpopulations  are  called 


strata.  To  obtain  the  full  benefit  from  stratification  N, 


<h) 


i 


must  be  known. 


When  the  strata  have  been  determined,  a  sample  is  drawn  from  each  different 


strata.  The  sample  sizes  within  the  strata  are  denoted  by  n 


(1)  _  (2) 


i 


n. 


<L) 


respectively.  If  a  simple  random  sample  is  taken  in  each  stratum. 


the  whole  procedure  is  described  as  stratified  random  sampling. 
Notations:  The  following  symbols  all  refer  to  stratum  h: 

(ft) 


X 

niJ 


ck) 

(k) 


^  (k) 

l  (k) 

{ J 


Total  number  of  units  of  U. 

1 

Total  number  of  units  in  sample 

Total  number  of  units  common  to  samples  from  U, 

and  U..  1 

J 

Total  number  of  units  common  to  and  U^. 

Estimator  of  number  of  units  common  to  U.  and  U, 

I  n 


tf".. 

VJ 


Total  number  of  units  common  to  U.  and  U.. 

i  3 


We  shall  denote  the  estimator  of  <7^-  for  the  populations  U.  and  by 
^V^st* 
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Problems 


(1)  To  define  estimators  of  ^ 

(2)  To  investigate  properties  of  such  estimators. 

(3)  To  determine  the  best  choice  of  n^*1)  and  n so  as  to 
obtain  maximum  precision. 

(4)  To  extend,  whenever  appropriate,  the  estimation  problems 
to  more  than  two  populations. 


Analysis 


Theorem  8;  In  any  stratum  h, 

(*) 


A  (A) 

r‘j 


.(h)  Ah) 
M  N, 


•i  1*1  (h) 

tW  TW  "i 


unbiased  estimator 


unbiased  estimator  of  VT.  . 


h.  n. '  • 

M  A  (h)  /Jih)  A/,  (h)  fj  (A)  w 

of  ^  and  in  general  .  «  _ t _ ^7  '  v,  A.,— -ft. .  . 

J  l"1  v*V»T77 <*rr-* 

.  rr*  *  *  A 


is  an 


is  an 


Proof;  Since  each  stratum  can  be  regarded  as  a  subpopulation  we  need 

,  — •  y  ^  (A)  I  r~  .  (h)  \ 

consider  c  (  v--  )  and  C  ( V-.  g  )  only  for  that  stratum.  Then  the 

/  'J"  •  A- 

proof  is  the  same  as  for  Theorem  1. 


ttt) 


Theorem  9:  If  in  every  stratum  h,  ^  is  an  unbiased  estimator 


of  ^  thenfay-  /  ^defined  by 


t(h) 


lr  o  .  A  r 

»  "  *  w  y 

is  an  unbiased  estimator  o£  , 

rr  ^  -<*) 

More  generally,  if  y.  !■  an  unbiased  estimator  ofvTi  then 

(V^y  3^  defined  by 
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(22) 


L  /  (U  ,  (h)  J  (M 

z'  /  »  ^  ^  •  :  •  £ 

'‘IJ-  •  •  & '  3t  "  .2 k  V  a/:  ...  A/t 

A-  /  y  A 


‘A 


is  an  unbiased  estimator  of  . 

y  •  *  •  /■ 

Proof:  We  want  to  show  that  A"  /  (  « 

-  (  (/ 

(k) 


In  every  stratum  h,  is  an  unbiased  estimator  of  V^*  so  that 


ik) 

(k)  .  #  (A) 


(L  *,<*>,  ft)  (k))  i  a,Wa/ 

£/(<?.')  I  -gfc  ■  - v:'  UJii 

/  y  *'}  (a-/  v  J  A-/  A[*  aJ 


ft) 


We  have  already  stated  that  the  strata  could  be  regarded  as  independent 

a/WuM 

” populations" .  Thus  *  Wj  is  the  probability  for  the  stratum  h  . 

z4#-  V*'  •  w  j 


7v/V- 
*  y. 


where  E  means  averaging  for  strata.  This  by  definition  is  indeed 


The  extension  to 


.  .  l) it 


is  obvious. 


Note:  The  proof  for  Theorem  8  does  not  use  any  particular  unbiased 
estimator  of  each  stratum. 


/■  M  /V: (h>  (h) 

Theorem  10:  For  <JT.  »  *~™‘  ^  fl.. 

‘J  ni  ■  Hj  ■  J 


and 
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I 


A  (h) 

ra-A 


(k)  ,  rt)  ,  M 

(»  ,■»  >  ' , 
,  w  „<W  "y-  •  l  ■  “d 


defined  by 


,A.  f  /A  »  f 

^r*  £}  ^  “d  ^  ^  fcT?  A{  4...^  •/ 


are  unbiased  estimators  of  7“".  • 

J 


Proof: 


c/,i  .  wr  .  Ofl)  >, 

E(^ij)Ar  EjjL*  Nth-  „.(tof x.(h)  Hij  I 

b  ^.a)Ha)  \y'  H;MN;(k)  a)/(^(A)Y^ 

1*  Kd:  MW n-  u)  ii  j  ppiTT 


where 


indicates  summation  in  the  stratum  h  , 


_^/A)  Afr  i( 


the  probability  that  a  unit  chosen  for  each  of  the  pair  of  samples  belongs  to 

stratum  h  .  To  make  sure  that  the  same  unit  *4  belongs  to  both  samples 

in  the  stratum  h  ,  we  assign  this  unit  to  both  samples  and  choose  the  remaining 

(n.^1)  _  1)  and  (n  ^  _  1)  units  from  stratum  h  for  the  two  samples. 

1  i 


Thus  if  we  define 


*  l‘.j‘ 


1  if  unit  *6  from  stratum  h  belongs  to  Samples 

S.  and  S., 

1  J 

0  otherwise. 
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I 


■£(%)  - 
•j'sr 


L  ,0>)  ,(k)  u)  /1J 

£  r*  <*>  UtfJ-//(>_J 

frr  M  4-W -v  J  w«n 

U*v  (>v 


where  /  .  is  summation  over  all  the  units  in  U.  and  U. 

tT  1  J 

Thus  £(%).*  ~  ZL$>  ,  ^ 


where 


£  ,... 
«u.j) 


1  if  unit  •<  belongs  to  U.  and 
0  otherwise. 


weight 


I  the  individual  strata  receive  their  correct 


The  extension  to  t\f~  }  is  obvious. 

'  ij ...  L  >sr 

a  a  .*  *  y* 

Note:  *8  not  t^ie  same  as  t^/y  )  where  /=  ,y 

The  difference  is  that  in  ( 9^*  )  the  individual  strata  receive  their  correct 

//.(*  A/.O,)  a  .  .  rL  \  <\ 

weight  —i-f —  /  -  It  is  evident  that  )  coincides  with  C  » )  jj 

^  /^.  y  J 

provided  in  each  stratum 

Mj  //i  <j£i 

*/  • 

»i«y  =  ~.,Vy  constant 

This  means  that  the  sampling  fraction  is  the  same  in  all  strata.  This  strati** 
fication  is  called  stratification  with  proportional  allocation  of  n,^  and  n.^ 
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Theorem  11:  For  stratified  sampling,  the  variance  of 


as  an 


estimate  of  CJ is 

Q 


Variance 


y  (U;  )  (ty  )_ 


V^sty  k-< 


tit 

Variance  \^lj  ) 


(24) 


In  general, 


Ujl  ,  Cil  Z  <Cj  J, 


Variance^  (  .  .  I  ) J  -- 1  (*)(Nil~W  •Variance  I 

1  1  seJ  Mt.MK.TN}  urtJ 

i  /  l 


(25) 


Proof: 


(it  Lit 


^,-cj 


Li)  &) 

"UXL  _ 

A'/zy 'j  J 


(L) 

<n-i 


(26) 


r  a\  &  a  at  (A>  . 

-  2  A/.  /V/  C  fly  -  c-j  ) 

1  *»  / 

Ni  Nj 

_  /  ,2  /  flO  )2  /  *  N  Jh  f 

s  )  (NJ  )  J 


a»  r-u  M  \fAk)  (i,\  (27) 


f  {  T.  Ni  -*0  Ni  NJ  Wj~*£f) 


A-ftk  “** 
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Since  the  error  in  the  estimate  is  now  expressed  as  a  weighted  mean  of 
the  errors  of  estimation  that  have  been  made  in  each  individual  strata. 


The  right  hand  side  of  Eq.  (23)  extends  over  all  pairs  of  strata.  We  now 

have  to  average  over  all  possible  samples.  For  any  cross  product  term, 

we  begin  by  keeping  the  sample  in  stratum  h  fixed  and  average  over  all 

samples  from  stratum  k.  Since  sampling  is  independent  in  the  two  strata, 

the  possible  samples  in  stratuih  k,  will  be  the  same  and  will  have  the  same 

probabilities,  whatever  sample  has  been  drawn  in  stratum  h.  But  since 
a  ti)  t*(ki  (tu 


it). 

is  assumed  unbiased,  the  average  [ iff-  -  <Tj:  / i 

'J  * 

cross  product  terms  vanish.  Thus 

L  /  a) 


is  zero.  Hence  all 


Variance 


ik) 


L  ,  (k\  (k.)  A  . 

-  Z(Ni  NJ  )  •  Variance  fa.  ) 

A8/  Ml  A//  " 

*  J 


(28] 


The  extension  to  is  obvious.  Note:  we  have  not  used  any  particuls 

unbiased  estimator  in  this  proof. 

J  (A)  (k) 


a«J  mV  «) 

Theorem  12:  When  •  z  ‘  '  J  -.nr 

J  m.N:  lJ 


and 


NfNj 

L  V  (V 


is 


/  (T-  )  =  £  NJ  .  Nt  tk)  ,  the  variance  fe  ) 

L  'J  *  /*/  7**/  xNZ'y  y  J 

1  J  '  J) 


Proof  follows  from  Theorems  2  and  11. 
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Appendix  D 


An  Experimental  Investigation  of  "Clustering" 


INTRODUCTION 

In  a  previous  report  under  this  contract  (Co-Author  Clusters: 

AF  Contract  19(626  )-10  Quarterly  Report  III)  a  technique  for  "clustering" 
on  the  basis  of  co-authorship  relationships  was  explored  in  which  net¬ 
works,  or  clusters,  of  authors  were  found  and  authors  ranked  according 
to  a  measure  of  centrality  within  each  net.  This  report  describes: 

A.  An  experimental  investigation  of  citation  indexing 
which  was  recently  undertaken  as  a  continuation 
and  extension  of  the  above  work  on  clustering 
techniques. 

B.  An  algorithm  for  automating  the  path-tracing  tech¬ 
nique  used  to  discover  central  authors  within  co¬ 
author  networks. 

A.  CITATION  INDEX  CLUSTERING 

A  citation  index  is  an  information  retrieval  tool  which  lists,  for 
each  document  in  the  index,  those  documents  that  have  cited  it--its 
"descendants".  Several  large  scale  citation  indices  are  now  being 
made  (1)  (2)  (3).  In  this  study,  a  variation  of  the  citation  index  concept, 
an  author  citation  index  is  being  explored,  which  lists,  for  each  author, 
those  authors  who  cite  him. 

Data  Collected 

A  small  citation  index  based  on  a  bibliography  collected  by 
Charles  Bourne  of  Stanford  Research  Institute,  plus  all  documents 
referred  to  in  these  articles  (in  the  field  of  Information  Retrieval) 
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has  been  made.  The  index  consists  of  approximately  500  original 
titles,  1300  cited  titles,  213  original  authors  and  450  cited  authors. 

Data  Analysis 

1.  Preliminary  data  describing  the  document  and  author 
samples  have  been  obtained  (total  number  of  different 
authors  and  titles  in  the  source  and  citation  decks, 
frequency  distributions,  growth  and  interconnectedness 
of  the  samples.)  It  was  found  that  93%  of  the  authors 
are  cited  5  times  or  less,  and  69%  are  cited  only  once. 
Eight  authors  were  cited  15  times  or  more  (M.  Bailey, 

C.  Bernier,  R.  Casey,  H.  Luhn,  C.  Mooers,  J.  Perry, 

C.  Shannon,  M.  Taube).  However,  in  one  sense  the 
matrix  is  highly  interconnected.  Of  the  213  authors  in 
the  source  deck,  62%  (132)  are  also  found  in  the  citation 
deck,  which  gives  an  indication  of  the  growth  of  the  author 
index. 

2.  An  attempt  to  partition  the  author  citation  index  in  a  way 
analogous  to  that  used  in  the  co-author  study  (find  all 
authors,  b,  c,  . .  .who  cite  a,  then  all  those  who  cite 

b,  jc,  .  •  •  etc.  )  is,  so  far,  yielding  much  larger  and  more 
highly  interconnected  networks  than  were  obtained  with 
co-authors.  One  network  found  contains  approximately 
48  authors.  Although  the  matrix  of  relations  is  of  course 
not  symmetric,  some  symmetry  was  found-- 7  of  the  48 
authors  cited  each  other. 
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3.  Examination  of  a  sample  citation  tracing  (on  titles) 
made  by  John  Tukey  indicated  a  "chaining"  effect  in 
the  citation  habits  of  the  authors,  (a  cites  b,  then  c 
cites  a  and  b,  d  cites  a,  b,  and  c  etc.  ).  This  chaining 
effect  will  have  to  be  considered  in  any  attempt  to 
obtain  "key"  authors  by  some  frequency-based  criteria. 

B.  PATH- TRACING  ALGORITHM  FOR  RANKING  CO-AUTHORS 


In  the  paper  on  co-author  clusters,  a  manual  path-tracing 
technique  for  finding  "central"  co-authors  in  a  communication  network 
of  co-authors  was  proposed.  Here,  an  algorithm,  suitable  for  com¬ 
puter  use  is  described  to  replace  the  manual  technique,  and  a  rough 
estimate  of  computer  time  calculated. 

In  the  original  paper  a  network  of  co-authors  was  partitioned 
into  sub-networks.  Then  a  sub-network,  e.  g. , 

Y 

was  drawn  in  which  the  nodes  represent  authors,  and  the  links,  the 
symmetric  relation  "co-authored  with".  From  this  graph,  a  matrix 
was  obtained  and  "central"  authors  derived  as  follows: 

Consider  author  i 


LetPij 

Let  dj 
Then 

Let  d^(min) 


the  minimum  path  (in  terms  of  number 

of  links)  between  i  and  j 

2td..  ■  cumulative  "distance" 

J  F*J 

between  i  and  all  other  authors  in  the  network, 
be  a  measure  of  "centrality"  for  i  (The  "central" 
author  is  the  one  for  whom  d^  is  minimum) 
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The  following  algorithm  assumes  that  the  network  has  been 
partitioned  into  subnetworks  and  describes  the  path-tracing  and 
summation  procedure  for  finding  the  "central"  authors  within  each 
sub-network.  To  automate,  we  make  use  of  the  following  known 
theorem: 


In  Ak  the  entry  a. 


(k)  . 


ij 


c  if  and  only  if  there  are  c  distinct 


k- chains  from  i  to  j.  (where  A  is  a  matrix  with  elements  a^,  k  is 


the 


th 


power  of  A,  and  a  k-chain  is  a  chain  with  k  links). 
Therefore,  we  proceed  as  follows: 

Consider  the  author  i: 


Let  a 


ij 


Let  k^ 


Then 


■  ^  1  if  i  and  j  are  co-authors 

0  if  not 

be  the  smallest  value  of  k  such  that 

a  <kL 
a.  i  >  o. 


£  k 

j  ij 


e.  g.  Considering  the  first  vector  a  . .  .look  for  all  non-zero 
entries.  Count  the  number  of  non-zero  entries.  If  all  entries  are 
non-zero,  proceed  to  the  next  vector.  If  some  zero  entries  are  found, 
take  the  square  of  the  matrix  A  .  Looking  only  for  those  entries 
ajj  which  were  zero  in  the  preceding  matrix,  count  all  entries  a^ 
which  are  now  non-zero.  Multiply  the  number  of  such  entries  by  2 
(the  power  of  the  matrix).  Add  the  number  obtained  to  the  previous 
sum.  Repeat  the  procedure  by  taking  successive  powers  of  the  matrix 
until  all  entries  ajj  have  been  accounted  for.  Then  rank  the  d^  as  in  the 
manual  technique).  For  example,  given  the  network: 
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The  manual  technique  yields  the  cumulative  distances  d^: 
dj  »4 

d2  -  4 

d3  »  6 

d4  =  6 


To  automate,  we  obtain  the  symmetric  matrix;  A 


a^  *uad  al3  are  non-zero,  therefore,  we  get  the  number  of  such  entries 
times  (1)  times  the  power  of  the  matrix,  yielding  the  partial  sum 
Pj  =  (2)  x  (1)  -  2 

The  square  of  the  above  matrix, 


Since  a^4  is  the  only  entry  in  the  vector  a  .  not  previously 
encountered,  we  look  for  it.  Here  it  is  non-zero,  so,  counting  the 
number  of  such  entries,  (1),  and  multiplying  by  the  power  of  the  matrix 
(2),  we  get  (1)  x  (2)  ■  2,  added  to  the  partial  sum  Pj  ■  the  new  partial 


sum  P^  =  4. 

Since  all  entries  in  this  vector  have  been  accounted  for  (a.,  is  not 

li 

considered)  we  need  not  take  further  powers  of  the  matrix,  and  P  =  4  =  d  , 

“  1 


4 


i 


1.06 


the  cumulative  distance  obtained  with  the  manual  technique.  The 
procedure  is  then  repeated  for  all  other  vectors. 

For  a  rough  order-of-magnitude  estimate  of  computing  time, 
assume  the  data  is  stored  with  one  cell  to  a  computer  word.  ("Packing" 
36  a^j  per  computer  word  may  be  more  feasible  for  larger  matrices). 
Then  if 

n  ■  order  of  the  matrix 

k  ■  highest  power  of  the  matrix  needed  to  find 
all  distances  d^ 

The  number  of  multiplications,  m,  required  is  at  most 
m  =  (k-1) 

(to  obtain  one  element  of  the  new  matrix  requires  n 
multiplications;  there  are  n  x  n  elements;  the  process 
is  repeated  (k-1)  times) 

For  the  largest  network  found  in  the  co-author  experiment 
n  =12 

k  =  4 

3  3 

m  =  3*12  •  5  x  10  multiplications 

Assuming  an  average  of  approximately  10  cycles/multiplication 
=  5  x  10^  cycles,  at  2H-s/cycle  =  .l  sec.  (IBM  7090) 

Letting  the  additions  increase  the  multiplication  time  by  l/5  (addition 
takes  2  cycles)  gives  a  total  of  approximately  .  12  seconds.  And 
doubling  the  above  figure  to  allow  for  other  programming  instructions 
gives  .  24  seconds  per  network. 

Since  23  networks  were  found  in  the  computer  experiment  (none 
as  large  as  above)  computing  should  not  be  more  than  6  seconds. 
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As  a  result  of  work  on  both  co-author  and  citation  patterns, 
the  utility  of  a  general  "path  tracing"  computer  program  became  clear. 
Such  a  program  would  be  applicable  to  citation  index  (title  citation 
index  and/or  author  c.i.  )  data,  as  well  as  to  co-author  data,  and 
would  have  the  following  functions*: 

a.  Give  the  distance  between  any  pair  of  nodes. 

Distance  to  be  defined  either  as  minimum  path  as 
in  the  co-author  study,  or  as  a  defined  distance-- 
such  as,  for  the  citation  authors, 
f  (m,  n,  o)  where 

m  »  number  of  times  a  cites  b 

n  =  number  of  authors  by  whom  b  is  cited 

o  =  number  of  authors  a  cites. 

b.  Given  a  distance,  find,  for  any  chosen  node  (person, 

title)  all  nodes  within  that  distance. 

c.  Given  a  node  (e.  g. ,  title  in  a  citation  index)  limit 
the  path  tracing  by  some  combination  of  words  within 
the  node  (e.  g. ,  given  a  title,  select  only  those  titles 
which  contain  words  a  and  b  but  not  c,  trace  only  those 
paths  emanating  from  the  selected  nodes.  . . ) 


*  In  the  following  description,  the  "nodes"  and  "links"  would  be: 


Author  Citation  Index 

Title  Citation  Index 

Co-Authors 

Nodes 

authors 

titles 

authors 

Links 

"is  cited  by" 

"is  cited  by" 

"co-authored 

with" 
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